IATI.cloud Documentation

IATI.cloud

IATI.cloud

Introduction

IATI.cloud extracts all published IATI XML files from the IATI Registry and stores all data in Apache Solr cores, allowing for fast access.

IATI is a global aid transparency standard and it makes information about aid spending easier to access, re-use and understand the underlying data using a unified open standard. You can find more about the IATI data standard at the IATI Standard website

We have recently moved towards a Solr Only version of the IATI.cloud. If you are looking for the hybrid IATI.cloud with Django API and Solr API, you can find this under the branch archive/iati-cloud-hybrid-django-solr

You can install this codebase using Docker. Follow the Docker installation guide for more information.

Setting up, running and using IATI cloud

Running and setting up is split into two parts: docker and manual. Because of the extensiveness of these sections they are contained in their own files. We’ve also included a usage guide, as well as a guide to how the IATI.cloud processes data. Find them here:

Requirements

Software

Software	Version (tested and working)	What is it for
Python	3.11	General runtime
PostgreSQL	LTS	Django and celery support
RabbitMQ	LTS	Messaging system for Celery
MongoDB	LTS	Aggregation support for Direct Indexing
Solr	9.8.1	Used for indexing IATI Data
(optional) Docker	LTS	Running full stack IATI.cloud
(optional) NGINX	LTS	Connection

Hardware

Disk space: Generally, around 600GB of disk space should be available for indexing the entire IATI dataset, especially if the json dump fields are active (with .env FCDO_INSTANCE=True). If not, you can get away with around 300GB.

RAM: Around 20GB of RAM has historically proven to be an issue, which led to us setting a RAM requirement of 40GB. Here is a handy guide to setting up RAM Swap

Local development

For local development, only a limited amount of disk space is required. The complete iati dataset unindexed is around 10GB, and you can limit the dataset indexing quite extensively, you can easily trim the size requirement down to less than 20GB, especially by limiting the datasets.

For local development, Docker and NGINX are not required, but docker is recommended to avoid system sided setup issues.

Submodules

We make use of a single submodule, which contains a dump of the Django static files for the administration panel, as well as the IATI.cloud frontend and streamsaver (used to stream large files to a user). To update the IATI.cloud frontend, create a fresh build from the frontend repository, and replace the files in the submodule. Specifically, we include the ./build folder, and copy the ./build/static/css, ./build/static/js and ./build/static/media directories to the static submodule.

To update the Django administrator static files, collect the django static, and update the files.

Lastly, StreamSaver is used to stream large files to the user.

Central python packages

Django is used to host a management interface for the Celery tasks, this was formerly used to host an API.

celery, combined with flower, django-celery-beat, and django-celery-results is used to manage multitask processing.

psycopg2-binary is used to connect to PostgreSQL.

python-dotenv is used for .env support.

lxml and MechanicalSoup are used for legacy working with XML Documents.

pysolr, xmljson and pymongo are used to support the direct indexing to Solr process.

Code Management

flake8 is used to maintain code quality in pep8 style

isort is used to maintain the imports

pre-commit is used to enforce commit styles in the form

feat: A new feature
fix: A bug fix
docs: Documentation only changes
style: Changes that do not affect the meaning of the code (white-space, formatting, missing semi-colons, etc)
refactor: A code change that neither fixes a bug nor adds a feature
perf: A code change that improves performance
test: Adding missing or correcting existing tests
chore: Changes to the build process or auxiliary tools and libraries such as documentation generation

Testing

We test with pytest,and use coverage to generage coverage reports. You can use . scripts/cov.sh to quickly run all tests and generate a coverage report. This also conveniently prints the location of the coverage HTML report, which can be viewed from your browser.

Contributing

Can I contribute?

Yes! We are mainly looking for coders to help on the project. If you are a coder feel free to Fork the repository and send us your amazing Pull Requests!

How should I contribute?

Python already has clear PEP 8 code style guidelines, so it’s difficult to add something to it, but there are certain key points to follow when contributing:

PEP 8 code style guidelines should always be followed. Tested with flake8 OIPA.
Commitlint is used to check your commit messages.
Always try to reference issues in commit messages or pull requests (“related to #614”, “closes #619” and etc.).
Avoid huge code commits where the difference can not even be rendered by browser based web apps (Github for example). Smaller commits make it much easier to understand why and how the changes were made, why (if) it results in certain bugs and etc.
When developing a new feature, write at least some basic tests for it. This helps not to break other things in the future. Tests can be run with pytest
If there’s a reason to commit code that is commented out (there usually should be none), always leave a “FIXME” or “TODO” comment so it’s clear for other developers why this was done.
When using external dependencies that are not in PyPI (from Github for example), stick to a particular commit (i. e. git+https://github.com/Supervisor/supervisor@ec495be4e28c694af1e41514e08c03cf6f1496c8#egg=supervisor), so if the library is updated, it doesn’t break everything
Automatic code quality / testing checks (continuous integration tools) are implemented to check all these things automatically when pushing / merging new branches. Quality is the key!

Who makes or made use of IATI.cloud?

Dutch Ministry of Foreign Affairs: www.openaid.nl
Finnish Ministry of Foreign Affairs: www.openaid.fi
FCDO Devtracker: devtracker.dfid.gov.uk
UNESCO Transparency Portal: opendata.unesco.org
Netherlands Enterprise Agency: aiddata.rvo.nl
Mohinga AIMS: mohinga.info
UN-Habitat: open.unhabitat.org
Overseas Development Institute: ODI.org
UN Migration: IOM.int
AIDA AIDA.tools

& many others

Branches

main - production ready codebase
develop - completed but not yet released changes
archive/iati-cloud-hybrid-django-solr - django based “OIPA” version of IATI.cloud. Decommissioned around halfway through 2022.

Other branches should be prefixed similarly to commits, like docs/added-usage-readme

Index

We provide an index file, which serves as a front facing page for iati.cloud. Currently, the index is created with pandoc, by combining the README and the markdown files in ./docs.

To update the index:

make sure pandoc is installed: sudo apt-get update && sudo apt-get install -y pandoc
Run bash scripts/update_docs_index.sh from the IATI.cloud root directory. Or do it manually with:

cat README.md ./docs/*.md > ./docs/combined.md
pandoc -s --metadata title="IATI.cloud Documentation" -o ./docs/index.html ./docs/combined.md
rm ./docs/combined.md

Ensure this is pushed to the correct branch or change the branch on github -> settings -> pages. # Installing and running IATI.cloud with Docker

Installing and running IATI.cloud with Docker

Quick start

Run the setup script from the IATI.cloud root directory

sudo bash scripts/setup.sh

Notes:

Answer with Y to the confirmation requests. This will ensure your setup will be complete and consistently reproducible.
We recommend providing Solr with 40GB max memory, as we have seen lower values leading to crashing due to maxing out the provided memory at 20GB. Use swap memory if necessary (setup provided).
Install Cockpit if installing on a server or even on a local linux machine, it is a helpful tool for machine maintenance.
If on a server, do set up NGINX and SSL for access. Ensure a domain is prepared, such as test.iati.cloud or iaticloud.example.com.
We recommend using a mounted directory for Solr storage, as it makes manipulation of the created cores much easier.
For the environment, do not use symbols in the password.
When prompted for a domain, follow the example provided. If the example contains http or https, include it in the value. If it does not, do not include it in the value.

Introduction

We want to have a full stack IATI.cloud application with the above specifications running. This includes Django, RabbitMQ and Celery, along with a postgres database, mongodb for aggregation, Apache Solr for document indexing, and lastly NGINX as a web server.

To accomplish this, we have created a docker-compose.yml configuration file, which starts all of the services. Each “cog in the system” is it’s own runnable docker container.

The services use the default docker compose network. Each service registers itself to the network through the service name. This allows the docker containers to connect to eachother. Where locally you would use localhost:5432, a docker container connecting to a PostgreSQL container would refer to database:5432. By providing a port like ports: 8000:8000, you allow the localhost port 8000 to connect through to the docker container’s port 8000.

.env

Please check out the environment variable reference in the local installation documentation. They are the same with the exception of the host IPs which are the services as explained above.

Services

service	network name	ports	image	Additional notes
database	database	5432	postgres:latest	Using the POSTGRES_ fields in .env to set up and access. `POSTGRES_USER`, `POSTGRES_PASSWORD`, `POSTGRES_DB`, self-explanatory default values for the user, password and database name. We mount `/var/lib/postgresql/data` to our `db_data` docker ‘volume’, which is persisted, meaning the container can be stopped and started without losing data.
rabbitmq	rabbitmq	5672,15672	rabbitmq:latest	We mount `/var/lib/rabbitmq` to our `rabbitmq_data` docker ‘volume’, which is persisted (as above).
mongo	mongo	27017	mongo:latest	Accessed through `mongodb://USER:PASS@mongo:27017` where USER and PASS are set in the `MONGO_INITDB_` fields in .env. We mount `/data/db` to our `mongo_data` docker ‘volume’, which is persisted (as above).
solr	solr	8983	bitnami/solr:9.1.1	Using bitnami instead of default solr because of the env options. We’re mounting the `/bitnami` directory to either the solr_data docker volume, or a local directory through the environment variable `SOLR_VOLUME`, which allows us to manipulate the core configuration. We pass SOLR_CORES with a list of all our cores. We pass `SOLR_OPTS` containing memory options. We’re using `SOLR_ADMIN_USERNAME` and `*_PASSWORD` to use authentication.
iaticloud	iaticloud	8000	. (local Dockerfile)	We build a Docker image with our IATI.cloud codebase. This image installs the requirements, Java 11 (for the Solr post tool), and runs the entrypoint. The entrypoint waits for the depended services to be fully started, then checks if this is the initial run of the IATI.cloud container. If not, it sets up the static files, sets up the database and sets up the superuser with the `DJANGO_SUPERUSER_*` .env variables.
celeryworker	none	ports	. (local Dockerfile)	This runs on the `iaticloud` docker image. It runs main celery workers with N concurrency where N is the n.o. cores in the available CPU.
celeryrevokeworker	none	ports	. (local Dockerfile)	This runs on the `iaticloud` docker image. It runs a single celery worker named Revoke to cancel all tasks
celeryscheduler	none	ports	. (local Dockerfile)	This runs on the `iaticloud` docker image. It runs celery beat
celeryflower	none	5555	. (local Dockerfile)	This runs on the `iaticloud` docker image. It runs celery flower task management interface, uses the password and username from CELERYFLOWER_ prefixed .env fields
nginx	nginx	80	./services/nginx	Runs NGINX and enables the flower and datastore subdomains for a provided domain. For local development it also allows subdomains. Customize `SOLR_AUTH_ENCODED` and `IC_DOMAIN`. iati.cloud-redirect is available but not enabled by default. The docker image is more described here.

Setup

We recommend using the ./scripts/setup.sh script to get everything set up for you, then running sudo docker compose up -d to start the required processes.

The following is a description of all the steps required to set up IATI.cloud through docker:

Set up the git submodule for Django static
Set up the environment file
1. Set up swap memory in case there is not enough RAM available.
Install docker
1. Install cockpit (linux dashboard, for server maintenance)
1. Install NGINX and Certbot
1. Set up Solr data on a mounted directory (optional but highly recommended for stable manipulation of cores)
Install Solr and set up the cores
Initial build of iaticloud docker image, to ensure proper building for usage in all celery services.

If you are looking for manual steps to installation, follow the chain of function in the scripts starting at setup.sh, or read up on the scripts that are available in the available scripts documentation.

Docker usage

Running basics

This assumes the setup has been done.

Start the entire stack with:

sudo docker compose up -d

Stopping the docker containers:

sudo docker compose down

Restarting

sudo docker compose down
sudo docker compose up -d

After docker service changes

For example, after updating the version of PostgresDB.

sudo docker compose build <SERVICE_NAME>

After changing solr configuration

For example, after changing the sector code from a pdouble to a pint in the budget core’s managed-schema file.

sudo docker compose up -d solr
sudo direct_indexing/solr/update_solr_cores.sh
sudo docker compose down

note: this should only be done on empty cores. otherwise, your core might be unable to start the updated core. Use the clear_all_cores task in Django admin.

Removing built docker images

sudo docker images
sudo docker image rm <ID>

Connecting to live docker containers

sudo docker exec -it <SERVICE_NAME> /bin/bash

Connecting to docker logs

sudo docker logs <SERVICE_NAME>

or, to get live updating logs

sudo docker logs <SERVICE_NAME> -f

Other notes

use --detach or -d to detach the docker containers from the current terminal.
use --build to rebuild the images.
use -f to get live updates of logs.

Usage

Check out the Usage guide for your next steps. # Installing IATI.cloud locally with all dependencies

Installing IATI.cloud locally with all dependencies

Introduction

The following is split up into two sections. The first is an installation guide to the services that are required for IATI.cloud, like python and solr. However historically we have seen that across systems installations differ nearly every time, and therefore this guide is not considered complete. Use it as a guideline, rather than a step by step guide. Of course, google is your friend, and installation guides can be found for most if not all systems.

The second part is a setup guide, which explains which steps to take to get your IATI.cloud instance up and running.

Alternatively, you can use docker locally as well. Read the Docker installation and setup guide.

Installation of dependencies

Install python

sudo add-apt-repository ppa:deadsnakes/ppa
sudo apt update 
sudo apt install python3.11

Install PostgreSQL

sudo apt-get install postgresql
sudo systemctl enable postgresql.service

Install MongoDB

sudo apt-get install gnupg
wget -qO - https://www.mongodb.org/static/pgp/server-6.0.asc | sudo apt-key add -
echo "deb [ arch=amd64,arm64 ] https://repo.mongodb.org/apt/ubuntu focal/mongodb-org/6.0 multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-6.0.list
sudo apt-get update
sudo apt install -y mongodb
sudo systemctl enable mongod.service

Install RabbitMQ

sudo apt-get install -y erlang
sudo apt-get install rabbitmq-server
sudo systemctl enable rabbitmq-server.service

Alternative RabbitMQ installation guide for Ubuntu Linux

Install Solr

# Install java
sudo apt-get update
sudo apt-get install openjdk-11-jdk openjdk-11-jre
cat >> /etc/environment <<EOL
JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
JRE_HOME=/usr/lib/jvm/java-11-openjdk-amd64/jre
EOL
# Install solr
cd /opt
wget https://archive.apache.org/dist/lucene/solr/9.8.1/solr-9.8.1.tgz
tar xzf solr-9.8.1.tgz solr-9.8.1/bin/install_solr_service.sh --strip-components=2
sudo bash ./install_solr_service.sh solr-9.8.1.tgz 

# Create required solr cores
sudo su - solr -c "/opt/solr/bin/solr create -c activity -n data_driven_schema_configs"
sudo su - solr -c "/opt/solr/bin/solr create -c budget -n data_driven_schema_configs"
sudo su - solr -c "/opt/solr/bin/solr create -c dataset -n data_driven_schema_configs"
sudo su - solr -c "/opt/solr/bin/solr create -c organisation -n data_driven_schema_configs"
sudo su - solr -c "/opt/solr/bin/solr create -c publisher -n data_driven_schema_configs"
sudo su - solr -c "/opt/solr/bin/solr create -c result -n data_driven_schema_configs"
sudo su - solr -c "/opt/solr/bin/solr create -c transaction -n data_driven_schema_configs"
sudo su - solr -c "/opt/solr/bin/solr create -c draft_activity -n data_driven_schema_configs"
sudo su - solr -c "/opt/solr/bin/solr create -c draft_budget -n data_driven_schema_configs"
sudo su - solr -c "/opt/solr/bin/solr create -c draft_result -n data_driven_schema_configs"
sudo su - solr -c "/opt/solr/bin/solr create -c draft_transaction -n data_driven_schema_configs"

sudo cp ./direct_indexing/solr/cores/activity/managed-schema /var/solr/data/activity/conf/managed-schema.xml
sudo cp ./direct_indexing/solr/cores/budget/managed-schema /var/solr/data/budget/conf/managed-schema.xml
sudo cp ./direct_indexing/solr/cores/dataset/managed-schema /var/solr/data/dataset/conf/managed-schema.xml
sudo cp ./direct_indexing/solr/cores/organisation/managed-schema /var/solr/data/organisation/conf/managed-schema.xml
sudo cp ./direct_indexing/solr/cores/publisher/managed-schema /var/solr/data/publisher/conf/managed-schema.xml
sudo cp ./direct_indexing/solr/cores/result/managed-schema /var/solr/data/result/conf/managed-schema.xml
sudo cp ./direct_indexing/solr/cores/transaction/managed-schema /var/solr/data/transaction/conf/managed-schema.xml
sudo cp -r ./direct_indexing/solr/cores/activity/xslt /var/solr/data/activity/conf/
sudo cp ./direct_indexing/solr/cores/activity/managed-schema /var/solr/data/draft_activity/conf/managed-schema.xml
sudo cp ./direct_indexing/solr/cores/budget/managed-schema /var/solr/data/draft_budget/conf/managed-schema.xml
sudo cp ./direct_indexing/solr/cores/result/managed-schema /var/solr/data/draft_result/conf/managed-schema.xml
sudo cp ./direct_indexing/solr/cores/transaction/managed-schema /var/solr/data/draft_transaction/conf/managed-schema.xml

sudo sed -i 's/<int name="maxFields">1000<\/int>/<int name="maxFields">2000<\/int>/' /var/solr/data/activity/conf/solrconfig.xml
sudo sed -i 's/<int name="maxFields">1000<\/int>/<int name="maxFields">2000<\/int>/' /var/solr/data/budget/conf/solrconfig.xml
sudo sed -i 's/<int name="maxFields">1000<\/int>/<int name="maxFields">2000<\/int>/' /var/solr/data/result/conf/solrconfig.xml
sudo sed -i 's/<int name="maxFields">1000<\/int>/<int name="maxFields">2000<\/int>/' /var/solr/data/transaction/conf/solrconfig.xml
sudo sed -i 's/<int name="maxFields">1000<\/int>/<int name="maxFields">2000<\/int>/' /var/solr/data/draft_activity/conf/solrconfig.xml
sudo sed -i 's/<int name="maxFields">1000<\/int>/<int name="maxFields">2000<\/int>/' /var/solr/data/draft_budget/conf/solrconfig.xml
sudo sed -i 's/<int name="maxFields">1000<\/int>/<int name="maxFields">2000<\/int>/' /var/solr/data/draft_result/conf/solrconfig.xml
sudo sed -i 's/<int name="maxFields">1000<\/int>/<int name="maxFields">2000<\/int>/' /var/solr/data/draft_transaction/conf/solrconfig.xml

Then, run nano /opt/solr/bin/solr and add SOLR_JAVA_MEM="-Xms20g -Xmx20g" or alternatively, however much memory you choose to assign. Then, run nano /opt/solr/server/etc/jetty.xml and change in LINE 71: <Set name="requestHeaderSize"><Property name="solr.jetty.request.header.size" default="8192" /></Set> TO <Set name="requestHeaderSize"><Property name="solr.jetty.request.header.size" default="65535" /></Set>

And restart Solr

sudo service solr restart

Setup

Dependencies

Ensure the following services are running:

Solr
MongoDB
PostgreSQL
RabbitMQ

.env

Make sure to set up your local .env file, we’ve provided an example under .env.example.local. The following is a table of fields in the .env file, their function and whether or not to change them.

Field name	Subsystem	Functionality	Changeable (No/Optional/Must)
`SECRET_KEY`	Django	Secret key	Must
`DEBUG`	Django	Impacts django settings	Optional: change on production to False
`FRESH`	Direct Indexing	Determines if a new dataset is downloaded	Optional
`THROTTLE_DATASET`	Direct Indexing	Reduces the number of datasets indexed, can be used to have a fast local run of the indexing process.	Optional: False in production
`DJANGO_STATIC_ROOT`	Django	Determines where Django static files are served	Optional: for local development
`DJANGO_STATIC_URL`	Django	Determines where Django static files are served	Optional: for local development
`POSTGRES_HOST`	Postgres	Host ip	Optional
`POSTGRES_PORT`	Postgres	Host port	Optional
`POSTGRES_DB`	Postgres	Initial db name	Optional
`POSTGRES_USER`	Postgres	Root user name	Must
`POSTGRES_PASSWORD`	Postgres	Root user pass	Must
`CELERY_BROKER_URL`	Celery	Connection to the message broker like RabbitMQ. Form: `ampq://<RABBITMQ HOST IP>`	Optional: Depends on your broker
`FCDO_INSTANCE`	Direct Indexing	Enables additional indexing features such as GBP conversion and JSON dump fields	Optional: enable on FCDO instances
`SOLR_ADMIN_USERNAME`	Solr	Admin username	Must
`SOLR_ADMIN_PASSWORD`	Solr	Admin password	Must
`SOLR_BASE_URL`	Solr	The connection string from python to solr. (Substitute ports if necessary.) Form with auth: `http://<SOLR_ADMIN_USERNAME>:<SOLR_ADMIN_PASSWORD>@<SOLR HOST IP>:8983/solr`, or without: `http://<SOLR HOST IP>:8983/solr`	Optional: If authentication is enabled
`SOLR_AUTH_ENCODED`	NGINX	A Base64 encoding of `<SOLR_ADMIN_USERNAME>:<SOLR_ADMIN_PASSWORD>`. We use base64encode.org.	Must
`MEM_SOLR_MIN`	Solr	The minimum available Solr memory	Optional
`MEM_SOLR_MAX`	Solr	The maximum available Solr memory	Optional
`SOLR_VOLUME`	Solr	Either the ‘docker volume’ solr_data, or a local mount directory like “SOLR_VOLUME=”/my/storage/iati.cloud/direct_indexing/solr_mount_dir”	Optional
`CELERYFLOWER_USER`	Celery	Flower access	Must
`CELERYFLOWER_PASSWORD`	Celery	Flower access	Must
`DJANGO_SUPERUSER_USERNAME`	Django	Initial superuser account	Must
`DJANGO_SUPERUSER_PASSWORD`	Django	Initial superuser account	Must
`DJANGO_SUPERUSER_EMAIL`	Django	Initial superuser account	Must
`MONGO_INITDB_ROOT_USERNAME`	MongoDB	Initial superuser account	Must
`MONGO_INITDB_ROOT_PASSWORD`	MongoDB	Initial superuser account	Must
`MONGO_INITDB_DATABASE`	MongoDB	Default MongoDB database. This is not changeable, but is required for the initialisation of MongoDB in fresh starts in docker.	No, this must remain `activities`
`MONGO_CONNECTION_STRING`	MongoDB	`mongodb://<MONGO_INITDB_ROOT_USERNAME>:<MONGO_INITDB_ROOT_PASSWORD>@<MONGO HOST IP>:27017`	Must
`IC_DOMAIN`	NGINX	The domain on which the current IATI.cloud setup is deployed, localhost in development, iati.cloud in production	Optional, in production with domain pointed at the server
`CSRF_TRUSTED_ORIGINS`	Django	Django trusted origins, like `https://iati.cloud` for iati.cloud. “A list of trusted origins for unsafe requests (e.g. POST).”	Must

Python

Install create a virtual environment

python3.11 -m venv ./env

Activate environment

source ./env/bin/activate

Upgrade pip

pip install --upgrade pip

PostgreSQL

Create a PostgreSQL database with name, username and password (example default values below)

sudo -u postgres psql

create database iati_cloud;
create user iati_cloud with encrypted password 'oipa';
grant all privileges on database iati_cloud to iati_cloud;

Run the initial migration

python manage.py migrate

Create a Django Admin user

python manage.py createsuperuser

Enter a username and a password. Emails are not required but feel free to use yours.

Preload the legacy currency conversion with data

python manage.py loaddata ./services/iaticloud/data_preload/legacy_currency_convert_dump.json

Running IATI.cloud manually

Run the django server:

python manage.py runserver

Run celery workers:

celery -A iaticloud worker -l INFO --concurrency=32 -n worker@%%h

Optionally run celery revoke queue:

celery -A iaticloud worker -l INFO -n revoke@%%h -Q revoke_queue

Optionally run celery aida workers

celery -A iaticloud worker -l INFO --concurrency=4 -n aida@%%h -Q aida_queue

Run celery beat

celery -A iaticloud beat -l INFO

Run celery flower

celery -A iaticloud flower -l INFO --port=5555

Usage

Check out the Usage guide for your next steps. # IATI.cloud dataset processing

IATI.cloud dataset processing

Introduction

The following is an explanation of the dataset processing flow for IATI.cloud.

Process overview

We use the code4iati dataset metadata and publisher metadata dumps to access all of the available metadata.

Publisher: We basically immediately index the publisher metadata as it is flat data.

Dataset: We download the code4iati dataset dump to access all of the available IATI datasets from the IATI Registry. If update is true, we check whether or not the hash has changed from the already indexed datasets. We then loop the datasets within the dataset metadata dump and trigger the subtask_process_dataset. For each dataset we clean the dataset metadata (where we extract the nested resources and extras). We then retrieve the filepath of the actual downloaded dataset based on the organisation name and dataset name. We check if the version is valid (in this case version 2). We get the type of the file from the metadata or the file content itself. We then check the dataset validation. Then we clear the existing data from this dataset if it is found in the IATI.cloud and the update flag is True. Then we trigger the indexing of the actual dataset. Once this is completed we store the success state of the latter to iati_cloud_indexed and we index the entire dataset metadata.

Indexing the dataset

First, we parse the IATI XML dataset. We then convert it to a dict using the BadgerFish algorithm.

We apply our cleaning and add custom fields. We then dump the dataset dict into a JSON file. Latstly, we extract the subtypes (budget, result and transactions)

Cleaning

We then recursively clean the dataset. @ values are removed, @{http://www.w3.org/XML/1998/namespace}lang is replaced with lang, and key-value fields are extracted. View the dataset cleaning implementation.

Adding custom fields

We have several “custom fields” that we enrich the IATI data with.

Codelist fields: These fields are ‘name’ representations of numeric/code values in the IATI Standard, for example an activity can report transaction-type.code: 3. We then enrich the activity with transaction-type.name: Disbursement.
Title narrative: We add a single-valued field with exclusively the first-reported title narrative.
Common activity dates: We add single value common start and end dates, so we immediately know a start and an end-date without looking through the planned and actual fields.
Activity status: We add single value activity-status.text field, to present the textual status alongside the code.
Combined policy marker: We add policy-marker.combined which is the policy marker code and its connected significance together.
Currency conversion: Explained in depth here.
Dataset metadata: We add interesting dataset metadata fields to the activity.
Hierarchy default value: “If hierarchy is not reported then 1 is assumed.”. Ensure this is enforced.
JSON dumps: A stringified JSON object of different IATI activity fields.
Date quarters: For each iso-date reported, also include a field in which quarter they are.
Document link categories: Provides a combined list of all the category codes for each document-link.
Currency aggregation: We add converted and aggregated values for budgets, disbursements and transactions/transaction subtypes.
Related activity data to parent activity: This ‘raises’ related activity budget data from the H2 activities to the H1 activities.

View detailed custom fields implementation

Extracting subtypes

We extract the subtypes to single valued fields. View the activity subtypes extraction implementation.

Each of these is indexed separately into its respective core.

Final step

Lastly, if the previous steps were all successful, we index the IATI activity data.

Development entry points

Django settings: contains the settings (set by the .env file).
AIDA
- Django urls: contains the accessible Django API REST endpoints.
- Django views: contains the functions used in Django URLS.
NGINX
- The directory scripts/setup/nginx_host_machine contains the Nginx configuration for iati.cloud
Tests
- The directory tests/direct_indexing contains the tests for the direct_indexing module.
The pre-commit config ensures proper git commit etiquette, along with the commitlint.
legacy_currency_convert: The original IMF rate parser that was implemented before the full IATI.cloud rewrite, and reused for currency conversion. Work from tasks.py
direct_indexing: The IATI.cloud rewrite result, originally, datasets were processed into a Django postgres database, after which the dataset was retrieved and converted to a Solr dataset. This can be reviewed under the git branch archive/iati-cloud-hybrid-django-solr. The main entrypoint here is tasks.py.
Django admin
- See USAGE.md -> task management # SCRIPTS

There are many scripts available. The following is a table displaying their function. For intricate details, use the -h or –help flag when running the script with bash, or simply open the scripts and read.

Script name	Category	Function	Sudo (root access) required
build.sh	Docker core	Simple trigger for `docker compose build`	Yes
clear_celery_queues.sh	Utility	Clear all items in all celery queues	Yes
cov.sh	Development	Run the tests written for the ./direct_indexing module	No
download_fcdo.sh	Utility	Based on requests by FCDO, re-downloads FCDO datasets	No
restart.sh	Development	Restart the docker services based on the python code, to immediately utilise the latest code as written locally.	Yes
select_env.sh	Utility	Activates the desired environment in case of having multiple environments present.	No
setup.sh	Setup	Main setup script, triggers subscripts after asking if they should be triggered	Yes
setup/install_cockpit	Setup	Installs `cockpit`	Yes
setup/install_docker	Setup	Installs `docker`	Yes
setup/install_nginx	Setup	Installs `NGINX` and `Certbot`, optionally triggers nginx and certbot setups.	Yes
setup/install_submodules	Setup	Inits and updates the git submodule, copies the static directory for the Django admin panel	No
setup/setup_environment	Setup	Creates .env files, symlinks the selected one, requests information such as usernames and passwords and updates the .env files	No
setup/setup_nginx	Setup	Updates the machine’s Nginx configuration with the required information	Yes
setup/setup_solr_mount_dir	Setup	Creates the solr_data directory where the user wants to mount their solr data.	Yes
setup/setup_solr	Setup	Creates and triggers the configuration of the Solr docker image	Yes
setup/setup_ssl	Setup	Sets up SSL certificates for the Nginx configuration	Yes
setup/setup_swap	Setup	Sets up swap space	Yes
start.sh	Docker core	Starts specified services	Yes
stop.sh	Docker core	Stops specified services	Yes
update_solr_cores.sh	Utility	Updates the solr cores with updated configuration	Yes
util.sh	Utility	Contains utility functions for use across scripts directory, never accessed directly as it has no function	No

Using IATI.cloud

Using IATI.cloud

Introduction

This file will contain a guide to how to use IATI.cloud as an administrator, from zero to a fully indexed IATI dataset, as well as some tips on querying data.

Administration

The IATI.cloud process is managed from Django. In the Django admin interface you can trigger the ‘Periodic tasks’, which execute things like clearing all of the Solr cores, indexing the entire IATI dataset, or indexing subsets, more about this in tasks.

The Django Administration interface, as seen in appendix 1 contains some user management, celery results, legacy currency convert and periodic tasks.

Celery results

The django celery results page is similar to the Celery Flower interface, the interface shows all of the dispatched tasks and their states. Results can be read here as well. In the Celery Flower interface you can also terminate running tasks, in case of necessity. These interfaces should be used to inspect tasks and task results.

Legacy currency convert

This is a feature that was developed initially for the IATI.cloud to enable currency conversion. It is a best-effort using the International Monetary Fund’s (IMF) monthly exchange rates in the form of SDR per currency unit, to extract as many conversion data points as possible, and converting the values that are in IATI Datasets at the exact value-date, meaning the conversion is applied at the moment of the value’s activity, rather than “now” resulting in more accurate conversion of the value.

Task management

To manage tasks in IATI.cloud, you will want to go to host:8000/admin/django_celery_beat/periodictask/, or simply https://iati.cloud/admin/django_celery_beat/periodictask on a live environment, substituting iati.cloud with your domain.

If the following core tasks do not exist, create them. The core tasks are:

Clear all cores:
- Task: direct_indexing.tasks.clear_all_cores
- Interval schedule: every second
- One-off Task: checked
- Arguments:
  - Keyword Arguments: {}
Index all:
- Task: direct_indexing.tasks.start
- Interval schedule: every second
- One-off Task: checked
- Arguments:
  - Keyword Arguments: {"update": 1}
Incremental Update for Direct Indexing:
- Task: direct_indexing.tasks.start
- Interval schedule: every 3 hours
- One-off Task: not checked
- Arguments:
  - Keyword Arguments: {"update": 1}

If the final task is enabled, every 3 hours, IATI.cloud will update to contain the latest found IATI data files.

Tasks overview

Task	Interface name	Functionality	Setup and arguments
celery.backend_cleanup	celery.backend_cleanup	Cleans up Celery backend	Automatic setup, every day on a crontab schedule
legacy_currency_convert.tasks.update_exchange_rates	Update the exchange rates	Updates the exchange rates using legacy currency convert	Automatic setup, every day on a crontab schedule
legacy_currency_convert.tasks.dump_exchange_rates	Dump exchange rates	Creates a JSON file for the direct indexing process	This is a subtask which is used by the system, not necessary as a runnable task
direct_indexing.metadata.dataset.subtask_process_dataset	Process dataset metadata	Starts the indexing of a provided dataset, updates the existing dataset in Solr if necessary	This is a subtask which is used by the system, not necessary as a runnable task. arguments: - dataset: a dataset metadata dict - update: a boolean flag whether or not to update the dataset.
direct_indexing.tasks.aida_async_drop	AIDA Drop data	Remove provided (draft) data, possible through Django admin, but meant to be used by the Django url `host:8000/aida/drop`	arguments: -ds_name: literal name of the dataset. -draft: (optional), if 1, the data is dropped from the draft core.
direct_indexing.tasks.aida_async_index	AIDA Index data	Index provided (draft) data, possible through Django admin, but meant to be used by the Django url `host:8000/aida/index`	arguments: -dataset: Json IATI registry dataset metadata. -publisher: name of the publisher. -ds_name: name of the dataset. -ds_url: url of the dataset to be downloaded. -draft: (optional), if 1, the data is dropped from the draft core.
direct_indexing.tasks.clear_all_cores	Clear all cores	Removes all of the data from all of the endpoints	Manual setup, every second and tick the `one-off task` checkbox.
direct_indexing.tasks.fcdo_replace_partial_url	FCDO Replace partial url matches	Used to update a dataset based on the provided URL. For example, if an existing dataset has the url ‘example.com/a.xml’, and a staging dataset is prepared at ‘staging-example.com/a.xml’, the file is downloaded and the iati datastore is refreshed with the new content for this file. Note: if the setting “FRESH” is active, and the datastore is incrementally updating, the custom dataset will be overwritten by the incremental update. If this feature is used, either disable the incremental updates (admin panel), or set the Fresh setting to false (source code).	Manual setup, every second and tick the `one-off task` checkbox. arguments: - find_url: the url to be replaced - replace_url: the new url
direct_indexing.tasks.revoke_all_tasks	Revoke all tasks	Cancels every task that is currently queued (does not cancel tasks currently being executed by Celery Workers).	Manual setup, every second and tick the `one-off task` checkbox.
direct_indexing.tasks.start	Start IATI.cloud indexing	Triggers an update for the IATI.cloud, downloads the latest metadata and dataset dump, and processes it.	Manual setup, every second and tick the `one-off task` checkbox. Alternatively, this can be set up on a crontab schedule every three (3) hours, as the dataset dump updates every three hours (note:remove the `one-off task` tick) arguments: - update: a boolean flag which indicates if the IATI.cloud should be updated. If `True`, the existing activities are updated, if `False`, drops all the data from the solr cores and does a complete re-index. - drop: a boolean flag which indicates whether or not older datasets (no longer available in this indexing cycle) should be removed from IATI.cloud.
direct_indexing.tasks.subtask_dataset_metadata	Dataset metadata subtask	Processes and indexes dataset metadata. This process also tringgers a dataset indexing task for every dataset metadata dict	This is a subtask which is used by the system, not necessary as a runnable task
direct_indexing.tasks.subtask_publisher_metadata	Publisher metadata subtask	Processes and indexes publisher metadata	This is a subtask which is used by the system, not necessary as a runnable task
direct_indexing.tasks.index_custom_dataset	Manually index a dataset with an URL	Manually indexes the provided dataset. The user needs to provide a URL, dataset title, dataset name (no spaces, for example fcdo_set-13 or finland_mfa-001), and organisation name.	Manual setup, every second and tick the `one-off task` checkbox. arguments: - url: the string of the XML Dataset URL. - title: A fancy title for the dataset. - name: A no-space dataset name. - org: The organisation name.
direct_indexing.tasks.remove_custom_dataset	Manually remove a custom dataset	Removes the provided custom indexed dataset.	Manual setup, every second and tick the `one-off task` checkbox. arguments: - dataset_id: The id of the dataset to be removed, can be found in the dataset core. - name: A no-space dataset name. - org: The organisation name.

Querying data

IATI.cloud data can be accessed through it’s endpoints:

Core	IATI.cloud endpoint	Available fields	IATI Reference	Requires authentication header
Activity	datastore.iati.cloud/api/v2/activity/?q=:	activity managed-schema	IATI Activity	No
Budget	datastore.iati.cloud/api/v2/budget/?q=:	budget managed-schema	IATI Budget	No
Dataset	datastore.iati.cloud/api/v2/dataset/?q=:	dataset managed-schema	Dataset metadata	No
Organisation	datastore.iati.cloud/api/v2/organisation/?q=:	organisation managed-schema	IATI Organisation	No
Publisher	datastore.iati.cloud/api/v2/publisher/?q=:	publisher managed-schema	Publisher metadata	No
Result	datastore.iati.cloud/api/v2/result/?q=:	result managed-schema	IATI Result	No
Transaction	datastore.iati.cloud/api/v2/transaction/?q=:	transaction managed-schema	IATI Transaction	No
AIDA Draft Activity	datastore.iati.cloud/api/v2/draft_activity/?q=:	activity managed-schema	IATI Activity	Yes, base64 encoded solr user:pass
AIDA Draft Budget	datastore.iati.cloud/api/v2/draft_budget/?q=:	budget managed-schema	IATI Budget	Yes, base64 encoded solr user:pass
AIDA Draft Dataset	datastore.iati.cloud/api/v2/draft_dataset/?q=:	dataset managed-schema	IATI Dataset	Yes, base64 encoded solr user:pass
AIDA Draft Result	datastore.iati.cloud/api/v2/draft_result/?q=:	result managed-schema	IATI Result	Yes, base64 encoded solr user:pass
AIDA Draft Transaction	datastore.iati.cloud/api/v2/draft_transaction/?q=:	transaction managed-schema	IATI Transaction	Yes, base64 encoded solr user:pass

Querying tips

Here are some tips on how to write effective queries for Solr:

Use clear and concise search terms: Make sure that your search terms accurately reflect what you are looking for. Avoid using vague or ambiguous terms that could lead to irrelevant results.
Use advanced search features: Solr supports a range of advanced search features, including phrase queries, wildcards, fuzzy search, and proximity search. Make use of these features to refine your search and get more accurate results.
- Phrase queries: Use double quotes to search for an exact phrase, such as “data science”. This will return results that contain the exact phrase “data science”.
- Wildcards: Use the asterisk (*) as a wildcard to match any number of characters, such as “data” to match “data”, “database”, “data science”, etc. You can also use a question mark (?) to match a single character.
- Fuzzy search: Use the tilde () followed by a number to search for terms that are similar to your search term, such as “data1” to match “data” and “date”. The number indicates the maximum number of edits (e.g., insertions, deletions, substitutions) that are allowed to make a match.
- Proximity search: Use the tilde (~) followed by a number to search for terms that occur within a certain distance of each other, such as “~data science5” to match “data” and “science” if they occur within five words of each other.
- Boolean operators: Use Boolean operators (AND, OR, NOT) to combine search terms, such as “data AND science” to find results that contain both “data” and “science”.
- Boosting: Use the caret (^) followed by a number to boost the relevance of a particular search term, such as “data^3 science” to give more weight to “data” in the search results.
- Field-specific search: Use the syntax “fieldname:searchterm” to search for a term within a specific field, such as “title:data science” to search for the term “data science” only in the title field.
Use filters to narrow down results: Solr supports a range of filtering options, including range filters, boolean filters, and facet filters. These can be used to narrow down search results to specific categories or ranges of data.
- Range filters: Use range filters to filter search results based on a specific range of values, such as a date range or a price range. For example, you can use the syntax “price:[10 TO 100]” to filter results to items with a price between 10 and 100.
- Boolean filters: Use boolean filters to filter search results based on specific criteria, such as whether a field is present or absent, or whether a field matches a specific value. For example, you can use the syntax “category:electronics AND in_stock:true” to filter results to electronics that are currently in stock.
- Facet filters: Use facet filters to filter search results based on specific categories, such as brand, category, or price range. This allows users to refine their search results by selecting specific facets, such as a specific brand or price range. For example, you can use the syntax “fq=category:electronics&facet.field=brand&facet.field=price_range” to display facets for brand and price range for the electronics category.
- Sorting: Use sorting to sort search results based on specific criteria, such as relevance, price, or date. This allows users to see the most relevant or useful search results first. For example, you can use the syntax “sort=price asc” to sort search results by price in ascending order.
Using date filtering and searching:
- Range filters: Solr allows you to use range filters to filter search results based on a specific range of dates. For example, you can use the syntax “created_at:[2022-01-01T00:00:00Z TO 2022-12-31T23:59:59Z]” to filter results to items created within the year 2022. This will match any documents where the “created_at” field falls within the specified date range.
- Date math: Solr provides a date math syntax that allows you to specify relative dates based on the current date, such as “NOW/DAY” to represent the start of the current day. For example, you can use the syntax “created_at:[NOW-1MONTH/DAY TO NOW/DAY]” to filter results to items created within the past month.
- Boosting: Solr allows you to boost search results based on the recency of the date. For example, you can use the syntax “recency_factor=linear(ms(NOW,created_at),1,0.1)” to give a higher score to more recent documents, which can help them appear higher in the search results.
- Sorting: Solr allows you to sort search results based on the date field. For example, you can use the syntax “sort=created_at desc” to sort search results in descending order based on the “created_at” field.
- Faceting: Solr allows you to facet search results based on date ranges, which can help users refine their search results based on specific date ranges. For example, you can use the syntax “fq=created_at:[NOW-1YEAR TO NOW]&facet.range=created_at&facet.range.start=NOW-1YEAR&facet.range.end=NOW&facet.range.gap=%2B1MONTH” to display facets for documents created within the past year, broken down by month.
Use boosting to prioritize certain results: Boosting can be used to give higher priority to certain search results based on specific criteria, such as relevance, recency, or popularity.
Test and refine your queries: Solr provides a range of tools for testing and refining your queries, including the Query Debugging tool and the Explain tool. Use these tools to identify any issues with your queries and make improvements as needed.
Make use of relevance scoring: Solr uses a relevance scoring system to rank search results based on their relevance to the search terms. Make use of this system to ensure that the most relevant results are displayed first.

Appendices

1. Django admin interface

Django admin interface showing navigation menu with Groups, Users, Celery Results, Exchange rates, Crontabs, Intervals, Periodic tasks, and Clocked options

2. django Celery Task results

Django Celery task results page displaying a table of task executions with columns for task name, arguments, status, date created, and date done

3. Celery Flower

Main interface: Celery Flower main interface showing task monitoring dashboard with active workers, task statistics, and real-time task execution graphs

Specific task result: Celery Flower task result detail page showing individual task execution information including task ID, state, arguments, result, and traceback