Aller au contenu

Prise en main de Databases Analytics Tools

Databases Analytics Tools est un micro-framework Python open source pour l'analyse de données. DB Analytics Tools est construit sur Psycopg2, Pyodbc, Pandas, Matplotlib et Scikit-learn. Il aide les analystes de données à interagir avec les entrepôts de données en tant que clients de bases de données traditionnelles.

En 2023 alors que je travaillais à Togocom, j'ai créé une classe Python pour facilter l'exécution des routines sur des périodes de plusieurs jours. J'ai ensuite fait évoluer l'idée pour aboutir à Databases Analytics Tools.

Pourquoi utiliser DB Analytics Tools ?

  • Easy to learn : It is high level API and doesn't require any special effort to learn.
  • Real problems solver : It is designed to solve real life problems of the Data Analyst
  • All in One : Support queries, Data Integration, Analysis, Visualization and Machine Learning

Principales composantes

# Component Description How to import
0 db Database Interactions (Client) import db_analytics_tools as db
1 dbi Data Integration & Data Engineering import db_analytics_tools.integration as dbi
2 dba Data Analysis import db_analytics_tools.analytics as dba
3 dbviz Data Visualization import db_analytics_tools.plotting as dbviz
4 dbml Machine Learning & MLOps import db_analytics_tools.learning as dbml

Installer DB Analytics Tools

DB Analytics Tools est bâti sur les packages suivants :

  • Python
  • Psycopg2
  • Pyodbc
  • Pandas
  • SQLAlchemy
  • Streamlit

DB Analytics Tools peut être installé avec pip

Install DB Analytics Tools
pip install db-analytics-tools

Get Started

Connexion à la base de données

Comme les clients de bases de données traditionnels, nous devons fournir l'adresse IP, le port et les informations d'identification du serveur de base de données. DB Analytics Tools prend en charge Postgres et SQL Server.

Setup client
# Import DB Analytics Tools
import db_analytics_tools as db

# Database Infos & Credentials
ENGINE = "postgres"
HOST = "localhost"
PORT = "5432"
DATABASE = "postgres"
USER = "postgres"
PASSWORD = "admin"

# Setup client
client = db.Client(host=HOST, 
                   port=PORT, 
                   database=DATABASE, 
                   username=USER, 
                   password=PASSWORD, 
                   engine=ENGINE)

DB Analytics Tools offres les mêmes fonctionnalités que les clients traditionnels ie, exécuter tous les types de requêtes à savoir

  • Data Definition Language (client.execute)
  • Data Control Language (client.execute)
  • Data Manipulation Language (client.execute)
  • Data Query Language (client.read_sql)

Data Definition Language

Create table
query = """
----- CREATE TABLE -----
drop table if exists public.transactions;
create table public.transactions (
    transaction_id integer primary key,
    client_id integer,
    product_name varchar(255),
    product_category varchar(255),
    quantity integer,
    unitary_price numeric,
    amount numeric
);
"""

client.execute(query=query)

Data Manipulation Language

Update table
query = """
----- POPULATE TABLE -----
insert into public.transactions (transaction_id, client_id, product_name, product_category, quantity, unitary_price, amount)
values
    (1,101,'Product A','Category 1',5,100,500),
    (2,102,'Product B','Category 2',3,50,150),
    (3,103,'Product C','Category 1',2,200,400),
    (4,102,'Product A','Category 1',7,100,700),
    (5,105,'Product B','Category 2',4,50,200),
    (6,101,'Product C','Category 1',1,200,200),
    (7,104,'Product A','Category 1',6,100,600),
    (8,103,'Product B','Category 2',2,50,100),
    (9,103,'Product C','Category 1',8,200,1600),
    (10,105,'Product A','Category 1',3,100,300);
"""

client.execute(query=query)

Data Query Language

Get data
query = """
----- GET DATA -----
select *
from public.transactions
order by transaction_id;
"""

dataframe = client.read_sql(query=query)
print(dataframe.head())
Output
   transaction_id  client_id product_name product_category  quantity  unitary_price  amount
0               1        101    Product A       Category 1         5          100.0   500.0
1               2        102    Product B       Category 2         3           50.0   150.0
2               3        103    Product C       Category 1         2          200.0   400.0
3               4        102    Product A       Category 1         7          100.0   700.0
4               5        105    Product B       Category 2         4           50.0   200.0

Implement SQL based ETL

ETL API is in the integration module db_analytics_tools.integration. Let's import it ans create an ETL object.

Setup ETL
# Import Integration module
import db_analytics_tools.integration as dbi

# Setup ETL
etl = dbi.ETL(client=client)

ETLs for DB Analytics Tools consists in functions with date parameters. Everything is done in one place i.e on the database. So first create a function on the database like this :

Create a procedure (function)
query = """
----- CREATE FUNCTION ON DB -----
create or replace function public.fn_test(rundt date) returns integer
language plpgsql
as
$$
begin
    --- DEBUG MESSAGE ---
    raise notice 'rundt : %', rundt;

    --- EXTRACT ---

    --- TRANSFORM ---

    --- LOAD ---

    return 0;
end;
$$;
"""

client.execute(query=query)

Exécuter une fonction

Then ETL function can easily be run using the ETL class via the method ETL.run()

Run function
# ETL Function
FUNCTION = "public.fn_test"

## Dates to run
START = "2023-08-01"
STOP = "2023-08-05"

# Run ETL
etl.run(function=FUNCTION, 
        start_date=START, 
        stop_date=STOP, 
        freq="d", 
        reverse=False)
Output
Function    : public.fn_test
Date Range  : From 2023-08-01 to 2023-08-05
Iterations  : 5
[Runing Date: 2023-08-01] [Function: public.fn_test] Execution time: 0:00:00.122600
[Runing Date: 2023-08-02] [Function: public.fn_test] Execution time: 0:00:00.049324
[Runing Date: 2023-08-03] [Function: public.fn_test] Execution time: 0:00:00.049409
[Runing Date: 2023-08-04] [Function: public.fn_test] Execution time: 0:00:00.050019
[Runing Date: 2023-08-05] [Function: public.fn_test] Execution time: 0:00:00.108267

Exécuter plusieurs fonctions

Most of time, several ETL must be run and DB Analytics Tools supports running functions as pipelines.

Run several functions
## ETL Functions
FUNCTIONS = [
    "public.fn_test",
    "public.fn_test_long",
    "public.fn_test_very_long"
]

## Dates to run
START = "2023-08-01"
STOP = "2023-08-05"

# Run ETLs
etl.run_multiple(functions=FUNCTIONS, 
                 start_date=START, 
                 stop_date=STOP, 
                 freq="d", 
                 reverse=False)
Output
Functions   : ['public.fn_test', 'public.fn_test_long', 'public.fn_test_very_long']
Date Range  : From 2023-08-01 to 2023-08-05
Iterations  : 5
*********************************************************************************************
[Runing Date: 2023-08-01] [Function: public.fn_test..........] Execution time: 0:00:00.110408
[Runing Date: 2023-08-01] [Function: public.fn_test_long.....] Execution time: 0:00:00.112078
[Runing Date: 2023-08-01] [Function: public.fn_test_very_long] Execution time: 0:00:00.092423
*********************************************************************************************
[Runing Date: 2023-08-02] [Function: public.fn_test..........] Execution time: 0:00:00.111153
[Runing Date: 2023-08-02] [Function: public.fn_test_long.....] Execution time: 0:00:00.111395
[Runing Date: 2023-08-02] [Function: public.fn_test_very_long] Execution time: 0:00:00.110814
*********************************************************************************************
[Runing Date: 2023-08-03] [Function: public.fn_test..........] Execution time: 0:00:00.111044
[Runing Date: 2023-08-03] [Function: public.fn_test_long.....] Execution time: 0:00:00.123229
[Runing Date: 2023-08-03] [Function: public.fn_test_very_long] Execution time: 0:00:00.078432
*********************************************************************************************
[Runing Date: 2023-08-04] [Function: public.fn_test..........] Execution time: 0:00:00.127839
[Runing Date: 2023-08-04] [Function: public.fn_test_long.....] Execution time: 0:00:00.111339
[Runing Date: 2023-08-04] [Function: public.fn_test_very_long] Execution time: 0:00:00.140669
*********************************************************************************************
[Runing Date: 2023-08-05] [Function: public.fn_test..........] Execution time: 0:00:00.138380
[Runing Date: 2023-08-05] [Function: public.fn_test_long.....] Execution time: 0:00:00.111157
[Runing Date: 2023-08-05] [Function: public.fn_test_very_long] Execution time: 0:00:00.077731
*********************************************************************************************


Pour en savoir plus sur Databases Analytics Tools, merci de consulter la documentation officielle https://joekakone.github.io/db-analytics-tools.