Objective
This document aims to equip the technical staff with the framework and best practices for supporting activities involving PetroAI data management. These activities primarily concern the preparation and ingestion of data into PetroAI as source data for the build pipeline the export of data back into the client’s systems to support analysis, client-specific dashboards, and integration into other aspects of client workflows.
Suggested Tools
There are many database administration and migration tools. For clients who don’t already have a preferred tool, we recommend DBeaver. This is the tool PetroAI uses internally, has found to be highly performant, and can readily support.
Data Ingestion
The objective of the Data Ingestion process is to deposit client data into the PetroAI source database such that the data can be used by the PetroAI build pipeline to create PetroAI Data Products.
Data Ingestion consists of two high-level processes
- Prepare client data – this includes gathering client data, transforming and standardizing data into the form required by PetroAI (e.g., converting units), and finally restructuring data into the PetroAI source database schema.
- Upload client data – this involves transferring the prepared client data into the PetroAI source database. This may involve pre-steps such as clearing existing data in the source database. It also may involve accommodating for transfers between different database types (e.g., Microsoft SQL to MySQL).
Stage 1: Preparing Data
As mentioned, there are three steps in this stage: gathering data, transforming data, and restructuring data.
Throughout this stage, we refer to a staging area, a dedicated directory (shared as necessary) where all data will be initially gathered. This staging area is hosted by the client, either locally, or in their preferred cloud environment. It can be useful to have data initially available in CSV/TSV/JSON formats where scripts can be easily run against them to gather, transform, and link the data as appropriate.
Step 2: Gathering Data
In this step, all data required to populate the PetroAI source database is gathered into the staging area. These may come from disparate sources and databases. As such at the end of this step, much of the data will likely be present in the staging area in CSV/TSV/JSON formats that aren’t necessarily structured in a way that reflects the way they’ll be linked to one another.
Automation tip: as much as possible, it’s ideal to write scripts that will download the relevant data files. When first collecting this data, this may be overly burdensome as the internal teams are figuring out where data is or how it should be exported. But over time, codifying these gathering steps in code can decrease effort and errors in the data collection process.
Step 3: Transforming Data
In this step, data fields are corrected/adjusted to agree with the PetroAI source database schema.
By the end of this step, every field in the data schema guide should be present somewhere in files in the staging area.
Automation tip: as in step 2, wherever possible, writing scripts to do the transformations on the raw data files from step 1 will speed up future data ingestion activities.
Automation tip: often data errors that break PetroAI builds can be caught at this transformation stage. Writing scripts that check for valid/reasonable data values in different fields can be a time-saving strategy.
Step 3: Restructuring Data
In this step, scripts are run to gather data from the transformed data files (from step 2) into files (or tables in a staging database) that comply with the PetroAI Source Database schema.
While not required, we advise clients to create and populate a staging database which contains all restructured data in exactly the PetroAI destination schema. Putting data into a staging database facilitates catching invalid data values in the data prior to the more time-intensive data upload process (Stage 2).
Stage 2: Uploading Data
In this stage, the prepared data from Stage 1 is loaded into the PetroAI Source Database.
Prior to transferring the data, it is important to attend to any data that may already be in the PetroAI Source Database. Depending on the nature of the Data Ingestion and what data is already present, the client may choose to delete this data before uploading new data or leave it in place, augmenting it with new data being copied in.
Depending on the configurations involved, copying can be done in a number of ways: a database administration tool (i.e., DBeaver), command line tools (e.g., mysql cli), or custom-written scripts (i.e. python).
Data Export
Data export involves taking the data for data products out of a specific build’s PetroAI Data Product database and storing it on-premises for integration into aspects of the client’s work processes (e.g., ad-hoc analysis, dashboarding, or as part of a workflow).
Exports will typically take one of two formats.
- Data files. In this case, the goal is to extract the data and store it in a set of CSV/TSV files.
- Database. In this case, the data is extracted and stored in a client-hosted database.
In either case, the primary design question is the structure of this data. There are two options here:
- A basic export would involve exporting individual tables from the Data Product database.
- A custom export involves using one or more SQL queries to extract columns – possibly from a variety of tables – to create an export that is structurally different from what is in the Data Products database.
Either of these cases can be achieved using a database administration tool (e.g., DBeaver), a command line tool (e.g., the MySQL cli), or a custom script. When first developing exports, using a database administration tool is often the best place to start in order to allow rapid iteration on the export and easy debugging.