About dbt: dbt is a transformation engine using code-first software engineering principles. Basically tons of text files defining what to transform and how. The good thing is that it comes with many features out of the box, such as documentation, lineage and testing. Ther are also many challenges, such as it can become very labor intensive and sometimes cost intensive, if traditional modelling techniques and a common sense is used. We aim to leverage the benefits of dbt and hopefully diminish some of the disadvantages
orchestrate whole pipeline at one place
trigger dbt cloud jobs, run dbt against your DWH or within Keboola platform
store dbt artifacts and run stats in one place
explore dbt docs on one click
have a local developer environment with cloned data in one command
We are supporting three major use cases from the day one:
We will be posting more informations and videos to help users to navigate new features.
How to get hands on it?
To make dbt work, we need two new projects features (that are cool by itself): read-only-storage and artifacts All new PAYG projects have those features activated by default. All existing projects can request the activation through support request.
This is a net script I have dug for one of our user asking on support how to easily trigger snapshot for multiple tables. Please not this could be (as python transformation) orchestrated for instance in your weekly/monthly flows to create snapshots of critical data that will last longer than automatic time-travel functionality.
Please note it is not ideal, since it exposes the token (but you can either create a dedicated one, or reset your personal after use).
All you have to use this in a python transformation and replace is the token and the table list:
This processor enables you to anonymize specified columns of an input table. In the configuration you specify the method of anonymization and the tables and the respective columns you wish to anonymize. Specified tables and their specified columns are anonymized, the rest is passed through to out/tables.
example of the use mentioned in the documentation:
Note: Don’t consume yourself with advanced authentication types e.g. OAuth, in the beginning. Start with ‘URL Query, ‘Basic HTTP’, ‘Login’ and know where to come back to in case another authentication type is required.
The data catalog represents an overview of data shared to and from the project. The data catalog allows you to share data in a very efficient, controlled and auditable way.
There are several options how you can share data:
Project Members – To the entire organization. Any user of any project in the organization can link the data bucket.
Organization Members – To administrators of the organization. Any user of any project in the organization can link the data bucket provided that they are also an administrator of the organization.
Selected Projects – To specified projects. Any user of the listed projects in the organization can link the data bucket.
Selected Users – To specified users. Any listed users in the organization can link the data bucket.
Shared catalogue details
Creating new catalogue
Subscribing to existing shared catalogue
Keboola Storage writer
This writer loads single or multiple tables from your current project into a different Keboola Connection project. The component can be used in situations where Data Catalog cannot, e.g., moving data between two different organizations or regions.
Extractor uses source project storage API token to setup a data extraction tunnel between source project and destination (current project). API token can be limited to buckets, tables, or a single table if needed.
Like  and , it requires the Keboola API token, which can be limited as mentioned before. Storage API supports quick sync and more robust async data load requests, as well as data preview requests, etc. More in the official documentation.
Since we are bringing a feature parity between different stacks (mostly existing stacks and pay-as-you-go one), I think it might be beneficial to discuss the new features and publish a bit of a guide how to do the same (for testing/developing SQL query in workspaces. Lets have a look on SQL workspaces now:
A workspace serves several purposes and can be used as
an interactive development environment (IDE) to create transformations.
an analytical workspace where you can interactively perform experiments and modelling with live production data.
an ephemeral workspace created on each run of a transformation to provide the staging area in which the transformation operates. Ephemeral transformation workspaces are not visible in the transformation UI, hence we won’t mention them further.
When a workspace is created, it enters the Active state and can be used.
Database (Snowflake, Redshift, and Synapse) workspaces are billed by the runtime of queries executed in them. As such, we leave them in active state until you delete them.