SETL

GitHub Support CommunityData Processing

SETL (pronounced “settle”) is a Scala ETL framework powered by Apache Spark that helps you structure your Spark ETL projects, modularize your data transformation logic and speed up your development.

Features

With SETL, an ETL application could be represented by a Pipeline. A Pipeline contains multiple Stages. In each stage, we could find one or several Factories.

The class Factory[T] is an abstraction of a data transformation that will produce an object of type T. It has 4 methods (read, process, write and get) that should be implemented by the developer.

The class SparkRepository[T] is a data access layer abstraction. It could be used to read/write a Dataset[T] from/to a datastore. It should be defined in a configuration file. You can have as many SparkRepositories as you want.

The entry point of a SETL project is the object io.github.setl.Setl, which will handle the pipeline and spark repository instantiation.

Official website

Tutorial and documentation

Enter your contact information to continue reading