Trino is a distributed SQL query engine designed for running fast and interactive analytical queries against various data sources. Originally developed at Facebook as Presto, Trino was later forked to create an independent project focused on advancing the platform's capabilities.

Key Features of Trino:

Distributed Architecture:
- Trino is designed to scale out across multiple nodes, enabling parallel execution of queries for high performance on large datasets.
Multi-Source Querying:
- It allows you to query data from multiple heterogeneous data sources, such as relational databases (MySQL, PostgreSQL), data lakes (AWS S3, HDFS), NoSQL stores, and more, without the need to move or replicate data.
ANSI SQL Compliance:
- Trino supports a broad range of SQL standards, making it suitable for complex analytical queries and seamless integration with tools that rely on SQL.
Connector-Based Architecture:
- Trino has connectors for various data sources. These connectors abstract the underlying data source details, allowing users to query them as if they were a single database.
In-Memory Execution:
- Queries are executed entirely in memory for fast response times, making it suitable for interactive analysis and business intelligence.
Open Source:
- Trino is an open-source project under the Apache License 2.0, with an active community of contributors and users.
Pluggable Extensions:
- Users can extend Trino’s capabilities by developing custom connectors or using built-in extensions.

Common Use Cases:

Data Lake Querying:
- Analyze data stored in distributed object stores like Amazon S3, Google Cloud Storage, or HDFS.
ETL and Data Transformation:
- Perform lightweight ETL tasks and transform data across disparate data sources.
Data Virtualization:
- Query multiple data sources as if they were a single database, enabling unified analytics without moving data.
Business Intelligence and Analytics:
- Integrates with tools like Tableau, Power BI, and Looker for fast, interactive querying of large datasets.
Ad-Hoc Querying:
- Quickly execute queries on large-scale datasets for data exploration and decision-making.

Popular Data Sources Trino Supports:

Relational Databases: MySQL, PostgreSQL, SQL Server, Oracle
Data Lakes: Amazon S3, Azure Blob Storage, HDFS
Columnar Stores: Apache Hive, Apache Iceberg, Delta Lake
NoSQL Databases: Apache Cassandra, MongoDB, Elasticsearch
Specialized Systems: Kafka, Redis, Google BigQuery

Trino is widely adopted in modern data engineering and analytics ecosystems.

Big-Data Eco System

If you worked on Hadoop Big data platform, you might be familiar with Hive, Sqoop while data is distributed in HDFS (Hadoop Distributed File System). When Hadoop initially came, you need to write Map/Reduce code in Java. Since DBA are most familiar with SQL Language so Hive & Sqoop came to process data and write your queries in SQL like language. Hive internally will convert SQL query into Java Map/Reduce . Hive provided SQL like joins, custom functions etc . Sqoop was used for ETL like export data from RDBMS and put in HDFS or vice-versa.

Feature	Hive	Trino
Type	Data warehouse infrastructure built on top of Hadoop.	Distributed SQL query engine for querying diverse data sources.
Purpose	Batch processing and ETL.	Interactive, low-latency analytics and data virtualization.
Origin	Developed by Facebook for large-scale batch processing with Hadoop.	Fork of Presto, originally developed by Facebook, now maintained as Trino.

Performace

Feature	Hive	Trino
Latency	High latency; designed for batch processing, not interactive queries.	Low latency; optimized for fast, interactive querying.
Execution Engine	Uses MapReduce (deprecated), Tez, or Spark for query execution.	Executes queries in-memory, using a distributed architecture.
Optimization	Supports query optimization with ORC/Parquet formats and CBO (Cost-Based Optimization).	Uses advanced query optimizations and parallelism for high performance.

Data Sources

Feature	Hive	Trino
Data Source Support	Designed primarily for Hadoop-based storage (HDFS, S3, etc.).	Supports a wide range of data sources including HDFS, RDBMS, NoSQL, and more.
Connectors	Limited connectors outside of Hadoop ecosystem.	Extensive set of connectors for relational databases, NoSQL stores, object stores, etc.

Trino SQL query is blazing fast in comparison to Hive SQL query. Trino runs query in memory, although it has option to spill data on disk but same query in Hive vs Trino and Trino query will run very fast.

Trino has connectors for many RDBMS, Hive, Delta Lake and many other data sources and you can write Trino query even joining tables from Oracle, MySQL and S3 etc. You can join Oracle Table with MySQL Table and PostgreSQL Table with very simple to use catalogs and SQL query syntax.

Trino Architecture

Trino has mainly two types of Nodes

Coordinator ( 1)
Workers ( 1 to many)

Although it is possible to install/configure Coordinator and worker on same node for testing purpose

1. Coordinator

The Coordinator is the central brain of the Trino cluster.
Responsibilities:
- Parsing and Planning:
  - Parses SQL queries and generates an optimized execution plan.
- Task Scheduling:
  - Assigns tasks to worker nodes and monitors their progress.
- Metadata Management:
  - Interfaces with the Hive Metastore (or other metadata services) to fetch schema and table information.
- Query Optimization:
  - Performs cost-based optimizations and determines the most efficient query execution plan.
Characteristics:

Single instance in a cluster.
Does not process data but orchestrates the execution of queries.

2. Workers

Workers are the nodes that execute the actual query tasks.
Responsibilities:
- Process data by reading from data sources, filtering, aggregating, and joining datasets.
- Perform query execution in parallel.
- Exchange intermediate results with other workers as required (e.g., during a distributed join or aggregation).
Characteristics:

Highly scalable; you can add more workers to handle larger workloads.
Communicate directly with data sources and with each other for intermediate results.

Hive Metastore or AWS Glue Catalog

Trino does not strictly depend on the Hive Metastore, but it often leverages it for metadata management, especially when querying data in systems like Hadoop, HDFS, or data lakes using file formats such as ORC and Parquet.

When Trino Uses the Hive Metastore

Metadata Management:
- The Hive Metastore provides schema information (e.g., table definitions, column types, and partitioning details) for datasets stored in data lakes (HDFS, S3, etc.).
Query Optimization:

Trino uses metadata from the Hive Metastore for optimizations like pruning partitions or determining file locations.

You do need catalog e.g. if you want to read data from RDBMS and then write to S3

Please read this blog for steps to install Hive Standalone Metastore, Trino. It is kind of complex topic and involves many steps.

This blog article lays down steps, although it is few years old but concepts will be same.

Trino latest version 467 requires JDK 23 version so modify steps to install JDK 23.

Querying S3 Object Stores with Trino

Trino Installation

Step 1: Prerequisites

Ensure Java 23 is Installed Check if Java is installed:

Trino version 467 requires JDK-23

java -version

If it's not installed, do the following:

Ubuntu:

In case if openjdk-23 is not available in your default repo

sudo add-apt-repository ppa:openjdk-r/ppa

sudo apt update

sudo apt install openjdk-23-jdk

CentOS/RHEL:

sudo yum install java-23-openjdk

Memory Requirements Make sure your system has at least 4GB of RAM.

Step 2: Download Trino 467 ( or you could install whatever latest version is available)

Download the Trino 467 tarball:

Create trino user

sudo adduser trino

sudo su - trino

Assuming your home folder is /home/trino

wget https://repo1.maven.org/maven2/io/trino/trino-server/467/trino-server-467.tar.gz

Step 3: Extract the Archive

Extract the downloaded tarball:

tar -xvzf trino-server-467.tar.gz

## You can create soft link trino-server which points to trino-server-467

ln -s trino-server-467 trino-server

cd trino-server-467 or trino-server

Step 4: Configure Trino

etc folder keeps all configuration files. You can create directory etc or keep config files outside of trino-server install so when you upgrade Trino version you don't have to copy config files but instead just create soft link. Just to keep things simple at this point, lets create etc folder inside

You can create the etc directory inside or trino-server-467

mkdir etc

mkdir -p /home/trino/data ## it will be used later

Add the following configuration files:

node.properties (defines the node settings):

##node.id will change from one node to another node and should be unique.

## You can instal uuid package to generate unique id e.g. l sudo apt install -y uuid

## node.data-dir: The location (filesystem path) of the data directory. Trino stores logs and other data here

node.environment=test

node.id=unique-node-id

node.data-dir=/home/trino/data

jvm.config (configures JVM options):
## Change Xmx based on your memory, usually put 70% of memory to JVM

-verbose:class

-server

-Xmx4G

-XX:+UseG1GC

-XX:G1HeapRegionSize=32M

-XX:+UseGCOverheadLimit

-XX:+ExplicitGCInvokesConcurrent

-XX:+HeapDumpOnOutOfMemoryError

-XX:+ExitOnOutOfMemoryError

-Djdk.attach.allowAttachSelf=true

-Dcom.sun.management.jmxremote.rmi.port=9081

config.properties (basic Trino server settings):

The following is a minimal configuration for the coordinator:

coordinator=true

node-scheduler.include-coordinator=false

http-server.http.port=8080

discovery.uri=http://example.net:8080

And this is a minimal configuration for the workers:

coordinator=false

http-server.http.port=8080

discovery.uri=http://your-trino-host:8080

Alternatively, if you are setting up a single machine for testing, that functions as both a coordinator and worker, use this configuration:

coordinator=true

node-scheduler.include-coordinator=true

http-server.http.port=8080

discovery.uri=http://your-trino-host:8080

Add log.properties with this content

io.trino=INFO

This is basic etc folder looks like. Create a catalog folder inside it also. This will hold

all connectors properties.

mkdir -p etc/catalog

etc$ ls -ltr

total 44

-rw-r--r-- 1 trino trino 14 Nov 27 05:56 log.properties

-rwxr-xr-x 1 trino trino 85 Dec 15 06:23 trino-env.sh

-rw-r--r-- 1 trino trino 264 Dec 16 03:33 jvm.config

drwxr-xr-x 2 trino trino 4096 Dec 16 03:43 catalog

-rw-r--r-- 1 trino trino 185 Dec 17 20:38 config.properties

-rw-r--r-- 1 trino trino 189 Dec 21 06:40 node.properties

cat trino-env.sh

export JAVA_HOME=/usr/lib/jvm/java-23-openjdk-amd64

export PATH=$JAVA_HOME/bin:$PATH

Step 5: Start Trino

Start the server using the included launcher script:

bin/launcher start

To check the status:

bin/launcher status

To stop the server:

bin/launcher stop

Log files are located under /home/trino/data/var/log/server.log. Where you have setup /home/trino/data in node.data-dir property in node.properties

Step 6: Verify Installation

Open your web browser and navigate to: http://trino-host:8080
Use the Trino Web UI to monitor the server Trino. It will shows all nodes , queries getting executed or finished etc.

your log file will display SERVER STARTED

2024-12-21T06:45:11.169Z INFO main io.trino.eventlistener.EventListenerManager -- Loaded event listener /home/trino/data/etc/mysql-event-listener.properties --

2024-12-21T06:45:11.249Z INFO main io.trino.server.Server Server startup completed in 18.22s

2024-12-21T06:45:11.249Z INFO main io.trino.server.Server ======== SERVER STARTED ========

Download Trino Command Line Client

You can download from here

e.g

wget https://repo1.maven.org/maven2/io/trino/trino-cli/467/trino-cli-467-executable.jar

After downloading you can rename file trino-cli-467-executable.jar to trino and copy or mv to /usr/local/bin

e.g. mv trino-cli-467-executable.jar /usr/local/bin/trino

chmod 755 /usr/local/bin/trino

To connect to Trino server from client

If you are running on same host where you have installed Trino Server

you can simply do

trino

trino> show catalogs ;

If you are running client from any other host

trino --server http://your-trino-host:8080

Create Catalog

See options for Catalog properties

Create Catalog for MySQL Database

CREATE CATALOG mysql_test USING mysql

WITH (

"connection-url" = 'jdbc:mysql://my-sql-end-point:3306',

"connection-user" = 'mysql_user',

"connection-password" = 'mysql_password'

);

## Now show catalogs will show catalog name mysql_test

trino> show schemas from mysql_test ;

## It will show you all schemas/database from catalog mysql_test

## Show all tables from Catalog name mysql_test and database name schema_name.

## Schema_name/Database Name are used interchangeably

trino> show tables from mysql_test.schema_name ;

## Select from one table, show only 10 rows on screen

trino > select * from mysql_test.schema_name. table_name limit 10 ;

## Trino uses catalogname.schema_name.table_name syntax

Create Catalog for SQL Server Database

CREATE CATALOG sqlserver_test USING sqlserver WITH ( "connection-url" = 'jdbc:sqlserver://sql=server-host:1433;databaseName=db_name;encrypt=false', "connection-user" = 'db_user', "connection-password" = 'db_password' );

You could join tables from different catalogs, same catalog, same catalog and different database etc

Join tables from two different catalogs

select ...

from mysql_test.db_name.table_name as A , sqlserver_test.db_name,table_name as B

where A.column=B.column

Storing Resultset to S3

## Before we can do that, it requires use of some metastore to store table metadata so for that either you can create Hive Standalone mteastore or use AWS Glue

## Assuming you have create hive metastore and it is configured to use S3

## Check this link in doc Querying S3 Object Stores with Trino for configuring Hive Metastore & S3 setup

CREATE SCHEMA hive.my_schema WITH (location = 's3://s3_bucket/data/');

We are are familiar with CTAS method

e.g.

Create Table S3Table As select ..... ;

Database & Data Engineering Topics

About Me

Friday, December 20, 2024

What is Trino , Basic Trino Installation and Setup

Key Features of Trino:

Common Use Cases:

Popular Data Sources Trino Supports:

Big-Data Eco System

Performace

Data Sources

Trino Architecture

1. Coordinator

2. Workers

Hive Metastore or AWS Glue Catalog

When Trino Uses the Hive Metastore

Querying S3 Object Stores with Trino

Trino Installation

Step 1: Prerequisites

Step 2: Download Trino 467 ( or you could install whatever latest version is available)

Step 3: Extract the Archive

Step 4: Configure Trino

Step 6: Verify Installation

Download Trino Command Line Client

Create Catalog

Create Catalog for MySQL Database

Create Catalog for SQL Server Database

Join tables from two different catalogs

Storing Resultset to S3

No comments: