Trino is a distributed SQL query engine designed for running fast and interactive analytical queries against various data sources. Originally developed at Facebook as Presto, Trino was later forked to create an independent project focused on advancing the platform's capabilities.
Key Features of Trino:
Distributed Architecture:
- Trino is designed to scale out across multiple nodes, enabling parallel execution of queries for high performance on large datasets.
Multi-Source Querying:
- It allows you to query data from multiple heterogeneous data sources, such as relational databases (MySQL, PostgreSQL), data lakes (AWS S3, HDFS), NoSQL stores, and more, without the need to move or replicate data.
ANSI SQL Compliance:
- Trino supports a broad range of SQL standards, making it suitable for complex analytical queries and seamless integration with tools that rely on SQL.
Connector-Based Architecture:
- Trino has connectors for various data sources. These connectors abstract the underlying data source details, allowing users to query them as if they were a single database.
In-Memory Execution:
- Queries are executed entirely in memory for fast response times, making it suitable for interactive analysis and business intelligence.
Open Source:
- Trino is an open-source project under the Apache License 2.0, with an active community of contributors and users.
Pluggable Extensions:
- Users can extend Trino’s capabilities by developing custom connectors or using built-in extensions.
Common Use Cases:
Data Lake Querying:
- Analyze data stored in distributed object stores like Amazon S3, Google Cloud Storage, or HDFS.
ETL and Data Transformation:
- Perform lightweight ETL tasks and transform data across disparate data sources.
Data Virtualization:
- Query multiple data sources as if they were a single database, enabling unified analytics without moving data.
Business Intelligence and Analytics:
- Integrates with tools like Tableau, Power BI, and Looker for fast, interactive querying of large datasets.
Ad-Hoc Querying:
- Quickly execute queries on large-scale datasets for data exploration and decision-making.
Popular Data Sources Trino Supports:
- Relational Databases: MySQL, PostgreSQL, SQL Server, Oracle
- Data Lakes: Amazon S3, Azure Blob Storage, HDFS
- Columnar Stores: Apache Hive, Apache Iceberg, Delta Lake
- NoSQL Databases: Apache Cassandra, MongoDB, Elasticsearch
- Specialized Systems: Kafka, Redis, Google BigQuery
Trino is widely adopted in modern data engineering and analytics ecosystems.
Big-Data Eco System
Performace
Data Sources
Trino Architecture
- Coordinator ( 1)
- Workers ( 1 to many)
1. Coordinator
- The Coordinator is the central brain of the Trino cluster.
- Responsibilities:
- Parsing and Planning:
- Parses SQL queries and generates an optimized execution plan.
- Task Scheduling:
- Assigns tasks to worker nodes and monitors their progress.
- Metadata Management:
- Interfaces with the Hive Metastore (or other metadata services) to fetch schema and table information.
- Query Optimization:
- Performs cost-based optimizations and determines the most efficient query execution plan.
- Parsing and Planning:
- Characteristics:
- Single instance in a cluster.
- Does not process data but orchestrates the execution of queries.
2. Workers
- Workers are the nodes that execute the actual query tasks.
- Responsibilities:
- Process data by reading from data sources, filtering, aggregating, and joining datasets.
- Perform query execution in parallel.
- Exchange intermediate results with other workers as required (e.g., during a distributed join or aggregation).
- Characteristics:
- Highly scalable; you can add more workers to handle larger workloads.
- Communicate directly with data sources and with each other for intermediate results.
Hive Metastore or AWS Glue Catalog
Trino does not strictly depend on the Hive Metastore, but it often leverages it for metadata management, especially when querying data in systems like Hadoop, HDFS, or data lakes using file formats such as ORC and Parquet.
When Trino Uses the Hive Metastore
- Metadata Management:
- The Hive Metastore provides schema information (e.g., table definitions, column types, and partitioning details) for datasets stored in data lakes (HDFS, S3, etc.).
- Query Optimization:
- Trino uses metadata from the Hive Metastore for optimizations like pruning partitions or determining file locations.
Trino Installation
Step 1: Prerequisites
Ensure Java 23 is Installed Check if Java is installed:
Trino version 467 requires JDK-23
java -version
If it's not installed, do the following:
Ubuntu:
In case if openjdk-23 is not available in your default repo
sudo add-apt-repository ppa:openjdk-r/ppa
sudo apt update
sudo apt install openjdk-23-jdk
CentOS/RHEL:
sudo yum install java-23-openjdk
Memory Requirements Make sure your system has at least 4GB of RAM.
Step 2: Download Trino 467 ( or you could install whatever latest version is available)
Download the Trino 467 tarball:
Create trino user
sudo adduser trino
sudo su - trino
Assuming your home folder is /home/trino
wget https://repo1.maven.org/maven2/io/trino/trino-server/467/trino-server-467.tar.gz
Step 3: Extract the Archive
Extract the downloaded tarball:
tar -xvzf trino-server-467.tar.gz
## You can create soft link trino-server which points to trino-server-467
##
ln -s trino-server-467 trino-server
cd trino-server-467 or trino-server
Step 4: Configure Trino
etc folder keeps all configuration files. You can create directory etc or keep config files outside of trino-server install so when you upgrade Trino version you don't have to copy config files but instead just create soft link. Just to keep things simple at this point, lets create etc folder inside
You can create the etc directory inside or trino-server-467
mkdir etc
mkdir -p /home/trino/data ## it will be used later
Add the following configuration files:
node.properties (defines the node settings):
##node.id will change from one node to another node and should be unique.
## You can instal uuid package to generate unique id e.g. l sudo apt install -y uuid
## node.data-dir: The location (filesystem path) of the data directory. Trino stores logs and other data here
node.environment=test
node.id=unique-node-id
node.data-dir=/home/trino/data
jvm.config (configures JVM options):
## Change Xmx based on your memory, usually put 70% of memory to JVM
-verbose:class
-server
-Xmx4G
-XX:+UseG1GC
-XX:G1HeapRegionSize=32M
-XX:+UseGCOverheadLimit
-XX:+ExplicitGCInvokesConcurrent
-XX:+HeapDumpOnOutOfMemoryError
-XX:+ExitOnOutOfMemoryError
-Djdk.attach.allowAttachSelf=true
-Dcom.sun.management.jmxremote.rmi.port=9081
config.properties (basic Trino server settings):
The following is a minimal configuration for the coordinator:
coordinator=true
node-scheduler.include-coordinator=false
http-server.http.port=8080
discovery.uri=http://example.net:8080
And this is a minimal configuration for the workers:
coordinator=false
http-server.http.port=8080
discovery.uri=http://your-trino-host:8080
Alternatively, if you are setting up a single machine for testing, that functions as both a coordinator and worker, use this configuration:
coordinator=true
node-scheduler.include-coordinator=true
http-server.http.port=8080
discovery.uri=http://your-trino-host:8080
Add log.properties with this content
io.trino=INFO
This is basic etc folder looks like. Create a catalog folder inside it also. This will hold
all connectors properties.
mkdir -p etc/catalog
etc$ ls -ltr
total 44
-rw-r--r-- 1 trino trino 14 Nov 27 05:56 log.properties
-rwxr-xr-x 1 trino trino 85 Dec 15 06:23 trino-env.sh
-rw-r--r-- 1 trino trino 264 Dec 16 03:33 jvm.config
drwxr-xr-x 2 trino trino 4096 Dec 16 03:43 catalog
-rw-r--r-- 1 trino trino 185 Dec 17 20:38 config.properties
-rw-r--r-- 1 trino trino 189 Dec 21 06:40 node.properties
cat trino-env.sh
export JAVA_HOME=/usr/lib/jvm/java-23-openjdk-amd64
export PATH=$JAVA_HOME/bin:$PATH
Step 5: Start Trino
Start the server using the included launcher script:
bin/launcher start
To check the status:
bin/launcher status
To stop the server:
bin/launcher stop
Step 6: Verify Installation
Open your web browser and navigate to: http://trino-host:8080
Use the Trino Web UI to monitor the server Trino. It will shows all nodes , queries getting executed or finished etc.
Download Trino Command Line Client
Create Catalog
Create Catalog for MySQL Database
CREATE CATALOG mysql_test USING mysql
WITH (
"connection-url" = 'jdbc:mysql://my-sql-end-point:3306',
"connection-user" = 'mysql_user',
"connection-password" = 'mysql_password'
);
No comments:
Post a Comment