About Me

Fremont, CA, United States

Friday, December 20, 2024

What is Trino , Basic Trino Installation and Setup

 Trino is a distributed SQL query engine designed for running fast and interactive analytical queries against various data sources. Originally developed at Facebook as Presto, Trino was later forked to create an independent project focused on advancing the platform's capabilities.

Key Features of Trino:

  1. Distributed Architecture:

    • Trino is designed to scale out across multiple nodes, enabling parallel execution of queries for high performance on large datasets.
  2. Multi-Source Querying:

    • It allows you to query data from multiple heterogeneous data sources, such as relational databases (MySQL, PostgreSQL), data lakes (AWS S3, HDFS), NoSQL stores, and more, without the need to move or replicate data.
  3. ANSI SQL Compliance:

    • Trino supports a broad range of SQL standards, making it suitable for complex analytical queries and seamless integration with tools that rely on SQL.
  4. Connector-Based Architecture:

    • Trino has connectors for various data sources. These connectors abstract the underlying data source details, allowing users to query them as if they were a single database.
  5. In-Memory Execution:

    • Queries are executed entirely in memory for fast response times, making it suitable for interactive analysis and business intelligence.
  6. Open Source:

    • Trino is an open-source project under the Apache License 2.0, with an active community of contributors and users.
  7. Pluggable Extensions:

    • Users can extend Trino’s capabilities by developing custom connectors or using built-in extensions.

Common Use Cases:

  1. Data Lake Querying:

    • Analyze data stored in distributed object stores like Amazon S3, Google Cloud Storage, or HDFS.
  2. ETL and Data Transformation:

    • Perform lightweight ETL tasks and transform data across disparate data sources.
  3. Data Virtualization:

    • Query multiple data sources as if they were a single database, enabling unified analytics without moving data.
  4. Business Intelligence and Analytics:

    • Integrates with tools like Tableau, Power BI, and Looker for fast, interactive querying of large datasets.
  5. Ad-Hoc Querying:

    • Quickly execute queries on large-scale datasets for data exploration and decision-making.

Popular Data Sources Trino Supports:

  • Relational Databases: MySQL, PostgreSQL, SQL Server, Oracle
  • Data Lakes: Amazon S3, Azure Blob Storage, HDFS
  • Columnar Stores: Apache Hive, Apache Iceberg, Delta Lake
  • NoSQL Databases: Apache Cassandra, MongoDB, Elasticsearch
  • Specialized Systems: Kafka, Redis, Google BigQuery

Trino is widely adopted in modern data engineering and analytics ecosystems.

Big-Data Eco System

If you worked on Hadoop Big data platform, you might be familiar with Hive, Sqoop while data is distributed in HDFS  (Hadoop Distributed File System). When Hadoop initially came, you need to write Map/Reduce code in Java. Since DBA are most familiar with SQL Language so Hive & Sqoop came to process data and write your queries in SQL like language. Hive internally will convert SQL query into Java Map/Reduce . Hive provided SQL like joins, custom functions etc . Sqoop was used for ETL like export data from RDBMS and put in HDFS or vice-versa.

FeatureHiveTrino
TypeData warehouse infrastructure built on top of Hadoop.Distributed SQL query engine for querying diverse data sources.
PurposeBatch processing and ETL.Interactive, low-latency analytics and data virtualization.
OriginDeveloped by Facebook for large-scale batch processing with Hadoop.Fork of Presto, originally developed by Facebook, now maintained as Trino.

Performace


FeatureHiveTrino
LatencyHigh latency; designed for batch processing, not interactive queries.Low latency; optimized for fast, interactive querying.
Execution EngineUses MapReduce (deprecated), Tez, or Spark for query execution.Executes queries in-memory, using a distributed architecture.
OptimizationSupports query optimization with ORC/Parquet formats and CBO (Cost-Based Optimization).Uses advanced query optimizations and parallelism for high performance.

Data Sources


FeatureHiveTrino
Data Source SupportDesigned primarily for Hadoop-based storage (HDFS, S3, etc.).Supports a wide range of data sources including HDFS, RDBMS, NoSQL, and more.
ConnectorsLimited connectors outside of Hadoop ecosystem.Extensive set of connectors for relational databases, NoSQL stores, object stores, etc.

Trino SQL query is blazing fast in comparison to Hive SQL query. Trino runs query in memory, although it has option to spill data on disk but same query in Hive vs Trino and Trino query will run very fast.

Trino has connectors for many RDBMS, Hive, Delta Lake and many other data sources and you can write Trino query even joining tables from Oracle, MySQL and S3 etc. You can join Oracle Table with MySQL Table and PostgreSQL Table with very simple to use catalogs and SQL query syntax.

Trino Architecture

Trino has mainly two types of Nodes
  • Coordinator ( 1)
  • Workers ( 1 to many)
Although it is possible to install/configure Coordinator and worker on same node for testing purpose

1. Coordinator

  • The Coordinator is the central brain of the Trino cluster.
  • Responsibilities:
    • Parsing and Planning:
      • Parses SQL queries and generates an optimized execution plan.
    • Task Scheduling:
      • Assigns tasks to worker nodes and monitors their progress.
    • Metadata Management:
      • Interfaces with the Hive Metastore (or other metadata services) to fetch schema and table information.
    • Query Optimization:
      • Performs cost-based optimizations and determines the most efficient query execution plan.
  • Characteristics:
    • Single instance in a cluster.
    • Does not process data but orchestrates the execution of queries.

2. Workers

  • Workers are the nodes that execute the actual query tasks.
  • Responsibilities:
    • Process data by reading from data sources, filtering, aggregating, and joining datasets.
    • Perform query execution in parallel.
    • Exchange intermediate results with other workers as required (e.g., during a distributed join or aggregation).
  • Characteristics:
    • Highly scalable; you can add more workers to handle larger workloads.
    • Communicate directly with data sources and with each other for intermediate results.

Hive Metastore or AWS Glue Catalog

Trino does not strictly depend on the Hive Metastore, but it often leverages it for metadata management, especially when querying data in systems like Hadoop, HDFS, or data lakes using file formats such as ORC and Parquet.


When Trino Uses the Hive Metastore

  • Metadata Management:
    • The Hive Metastore provides schema information (e.g., table definitions, column types, and partitioning details) for datasets stored in data lakes (HDFS, S3, etc.).
  • Query Optimization:
    • Trino uses metadata from the Hive Metastore for optimizations like pruning partitions or determining file locations.

You do need catalog e.g. if you want to read data from RDBMS and then write to S3

Please read this blog for steps to install Hive Standalone Metastore, Trino. It is kind of complex topic and involves many steps.
 This blog article lays down steps, although it is few years old but concepts will be same. 
Trino latest version 467 requires JDK 23 version so modify steps to install JDK 23.


Trino Installation

Step 1: Prerequisites

Ensure Java 23 is Installed Check if Java is installed:

Trino version 467 requires JDK-23

java -version

  1. If it's not installed, do the following:

Ubuntu:

In case if openjdk-23 is not available in your default repo


sudo add-apt-repository ppa:openjdk-r/ppa

sudo apt update

sudo apt install openjdk-23-jdk


CentOS/RHEL:

sudo yum install java-23-openjdk


  1. Memory Requirements Make sure your system has at least 4GB of RAM.


Step 2: Download Trino 467 ( or you could install whatever latest version is available)

Download the Trino 467 tarball:

Create trino user

sudo adduser trino

sudo su - trino

Assuming your home folder is /home/trino


wget https://repo1.maven.org/maven2/io/trino/trino-server/467/trino-server-467.tar.gz



Step 3: Extract the Archive

Extract the downloaded tarball:


tar -xvzf trino-server-467.tar.gz

## You can create soft link trino-server which points to trino-server-467

## 

ln -s trino-server-467 trino-server

cd trino-server-467 or trino-server



Step 4: Configure Trino

etc   folder keeps all configuration files. You can create directory etc or keep config files outside of trino-server install so when you upgrade Trino version you don't have to copy config files but instead just create soft link. Just to keep things simple at this point, lets create etc folder inside 


You can create the etc directory inside or trino-server-467

mkdir etc


mkdir -p /home/trino/data ## it will be used later


  1. Add the following configuration files:

node.properties (defines the node settings):


##node.id will change from one node to another node and should be unique.

## You can instal uuid package to generate unique id e.g. l sudo apt install -y uuid 

## node.data-dir: The location (filesystem path) of the data directory. Trino stores logs and other data here



node.environment=test

node.id=unique-node-id

node.data-dir=/home/trino/data


jvm.config (configures JVM options):
## Change Xmx based on your memory, usually put 70% of memory to JVM


-verbose:class

-server

-Xmx4G

-XX:+UseG1GC

-XX:G1HeapRegionSize=32M

-XX:+UseGCOverheadLimit

-XX:+ExplicitGCInvokesConcurrent

-XX:+HeapDumpOnOutOfMemoryError

-XX:+ExitOnOutOfMemoryError

-Djdk.attach.allowAttachSelf=true

-Dcom.sun.management.jmxremote.rmi.port=9081




config.properties (basic Trino server settings):


The following is a minimal configuration for the coordinator:

coordinator=true

node-scheduler.include-coordinator=false

http-server.http.port=8080

discovery.uri=http://example.net:8080

And this is a minimal configuration for the workers:

coordinator=false

http-server.http.port=8080

discovery.uri=http://your-trino-host:8080

Alternatively, if you are setting up a single machine for testing, that functions as both a coordinator and worker, use this configuration:

coordinator=true

node-scheduler.include-coordinator=true

http-server.http.port=8080

discovery.uri=http://your-trino-host:8080



Add log.properties with this content

io.trino=INFO

This is basic etc folder looks like. Create a catalog folder inside it also. This will hold

all connectors properties.

mkdir -p etc/catalog

etc$ ls -ltr

total 44

-rw-r--r-- 1 trino trino   14 Nov 27 05:56 log.properties

-rwxr-xr-x 1 trino trino   85 Dec 15 06:23 trino-env.sh

-rw-r--r-- 1 trino trino  264 Dec 16 03:33 jvm.config

drwxr-xr-x 2 trino trino 4096 Dec 16 03:43 catalog

-rw-r--r-- 1 trino trino  185 Dec 17 20:38 config.properties

-rw-r--r-- 1 trino trino  189 Dec 21 06:40 node.properties

cat trino-env.sh 

export JAVA_HOME=/usr/lib/jvm/java-23-openjdk-amd64

export PATH=$JAVA_HOME/bin:$PATH

Step 5: Start Trino

Start the server using the included launcher script:

bin/launcher start


To check the status:

bin/launcher status


To stop the server:

bin/launcher stop


Log files are located under /home/trino/data/var/log/server.log. Where you have setup 
/home/trino/data in node.data-dir property in node.properties 



Step 6: Verify Installation

  1. Open your web browser and navigate to: http://trino-host:8080

  2. Use the Trino Web UI to monitor the server Trino. It will shows all nodes , queries getting executed or finished etc.

your log file will display SERVER STARTED

2024-12-21T06:45:11.169Z INFO main io.trino.eventlistener.EventListenerManager -- Loaded event listener /home/trino/data/etc/mysql-event-listener.properties --
2024-12-21T06:45:11.249Z INFO main io.trino.server.Server Server startup completed in 18.22s
2024-12-21T06:45:11.249Z INFO main io.trino.server.Server ======== SERVER STARTED ========
  

Download Trino Command Line Client

e.g
wget https://repo1.maven.org/maven2/io/trino/trino-cli/467/trino-cli-467-executable.jar

After downloading you can rename file trino-cli-467-executable.jar to trino and copy or mv to /usr/local/bin
e.g. mv trino-cli-467-executable.jar /usr/local/bin/trino
chmod 755 /usr/local/bin/trino

To connect to Trino server from client
If you are running on same host where you have installed Trino Server
you can simply do

trino
trino> show catalogs ;

If you are running client from any other host
trino --server http://your-trino-host:8080 


Create Catalog

Create Catalog for MySQL Database


CREATE CATALOG mysql_test USING mysql

WITH (

  "connection-url" = 'jdbc:mysql://my-sql-end-point:3306',

  "connection-user" = 'mysql_user',

  "connection-password" = 'mysql_password'

);


## Now show catalogs will show catalog name mysql_test
##
trino> show schemas from mysql_test ;
## It will show you all schemas/database from catalog mysql_test
## Show all tables from Catalog name mysql_test and database name schema_name.
## Schema_name/Database Name are used interchangeably
trino> show tables from mysql_test.schema_name ;
## Select from one table, show only 10 rows on screen
trino > select * from mysql_test.schema_name. table_name limit 10 ;

## Trino uses catalogname.schema_name.table_name syntax

Create Catalog for SQL Server Database

CREATE CATALOG sqlserver_test USING sqlserver WITH ( "connection-url" = 'jdbc:sqlserver://sql=server-host:1433;databaseName=db_name;encrypt=false', "connection-user" = 'db_user', "connection-password" = 'db_password' );

You could join tables from different catalogs, same catalog, same catalog and different database etc


Join tables from two different catalogs

select ...
from mysql_test.db_name.table_name as A , sqlserver_test.db_name,table_name as B
where A.column=B.column

Storing Resultset to S3

## Before we can do that, it requires use of some metastore to store table metadata so for that either you can create Hive Standalone mteastore or use AWS Glue

## Assuming you have create hive metastore and it is configured to use S3

## Check this link in doc Querying S3 Object Stores with  Trino for configuring Hive Metastore & S3 setup

CREATE SCHEMA hive.my_schema WITH (location = 's3://s3_bucket/data/');

We are are familiar with CTAS method
e.g.
Create Table S3Table As select ..... ;


No comments: