About Me

Fremont, CA, United States

Friday, December 20, 2024

What is Trino , Basic Trino Installation and Setup

 Trino is a distributed SQL query engine designed for running fast and interactive analytical queries against various data sources. Originally developed at Facebook as Presto, Trino was later forked to create an independent project focused on advancing the platform's capabilities.

Key Features of Trino:

  1. Distributed Architecture:

    • Trino is designed to scale out across multiple nodes, enabling parallel execution of queries for high performance on large datasets.
  2. Multi-Source Querying:

    • It allows you to query data from multiple heterogeneous data sources, such as relational databases (MySQL, PostgreSQL), data lakes (AWS S3, HDFS), NoSQL stores, and more, without the need to move or replicate data.
  3. ANSI SQL Compliance:

    • Trino supports a broad range of SQL standards, making it suitable for complex analytical queries and seamless integration with tools that rely on SQL.
  4. Connector-Based Architecture:

    • Trino has connectors for various data sources. These connectors abstract the underlying data source details, allowing users to query them as if they were a single database.
  5. In-Memory Execution:

    • Queries are executed entirely in memory for fast response times, making it suitable for interactive analysis and business intelligence.
  6. Open Source:

    • Trino is an open-source project under the Apache License 2.0, with an active community of contributors and users.
  7. Pluggable Extensions:

    • Users can extend Trino’s capabilities by developing custom connectors or using built-in extensions.

Common Use Cases:

  1. Data Lake Querying:

    • Analyze data stored in distributed object stores like Amazon S3, Google Cloud Storage, or HDFS.
  2. ETL and Data Transformation:

    • Perform lightweight ETL tasks and transform data across disparate data sources.
  3. Data Virtualization:

    • Query multiple data sources as if they were a single database, enabling unified analytics without moving data.
  4. Business Intelligence and Analytics:

    • Integrates with tools like Tableau, Power BI, and Looker for fast, interactive querying of large datasets.
  5. Ad-Hoc Querying:

    • Quickly execute queries on large-scale datasets for data exploration and decision-making.

Popular Data Sources Trino Supports:

  • Relational Databases: MySQL, PostgreSQL, SQL Server, Oracle
  • Data Lakes: Amazon S3, Azure Blob Storage, HDFS
  • Columnar Stores: Apache Hive, Apache Iceberg, Delta Lake
  • NoSQL Databases: Apache Cassandra, MongoDB, Elasticsearch
  • Specialized Systems: Kafka, Redis, Google BigQuery

Trino is widely adopted in modern data engineering and analytics ecosystems.

Big-Data Eco System

If you worked on Hadoop Big data platform, you might be familiar with Hive, Sqoop while data is distributed in HDFS  (Hadoop Distributed File System). When Hadoop initially came, you need to write Map/Reduce code in Java. Since DBA are most familiar with SQL Language so Hive & Sqoop came to process data and write your queries in SQL like language. Hive internally will convert SQL query into Java Map/Reduce . Hive provided SQL like joins, custom functions etc . Sqoop was used for ETL like export data from RDBMS and put in HDFS or vice-versa.

FeatureHiveTrino
TypeData warehouse infrastructure built on top of Hadoop.Distributed SQL query engine for querying diverse data sources.
PurposeBatch processing and ETL.Interactive, low-latency analytics and data virtualization.
OriginDeveloped by Facebook for large-scale batch processing with Hadoop.Fork of Presto, originally developed by Facebook, now maintained as Trino.

Performace


FeatureHiveTrino
LatencyHigh latency; designed for batch processing, not interactive queries.Low latency; optimized for fast, interactive querying.
Execution EngineUses MapReduce (deprecated), Tez, or Spark for query execution.Executes queries in-memory, using a distributed architecture.
OptimizationSupports query optimization with ORC/Parquet formats and CBO (Cost-Based Optimization).Uses advanced query optimizations and parallelism for high performance.

Data Sources


FeatureHiveTrino
Data Source SupportDesigned primarily for Hadoop-based storage (HDFS, S3, etc.).Supports a wide range of data sources including HDFS, RDBMS, NoSQL, and more.
ConnectorsLimited connectors outside of Hadoop ecosystem.Extensive set of connectors for relational databases, NoSQL stores, object stores, etc.

Trino SQL query is blazing fast in comparison to Hive SQL query. Trino runs query in memory, although it has option to spill data on disk but same query in Hive vs Trino and Trino query will run very fast.

Trino has connectors for many RDBMS, Hive, Delta Lake and many other data sources and you can write Trino query even joining tables from Oracle, MySQL and S3 etc. You can join Oracle Table with MySQL Table and PostgreSQL Table with very simple to use catalogs and SQL query syntax.

Trino Architecture

Trino has mainly two types of Nodes
  • Coordinator ( 1)
  • Workers ( 1 to many)
Although it is possible to install/configure Coordinator and worker on same node for testing purpose

1. Coordinator

  • The Coordinator is the central brain of the Trino cluster.
  • Responsibilities:
    • Parsing and Planning:
      • Parses SQL queries and generates an optimized execution plan.
    • Task Scheduling:
      • Assigns tasks to worker nodes and monitors their progress.
    • Metadata Management:
      • Interfaces with the Hive Metastore (or other metadata services) to fetch schema and table information.
    • Query Optimization:
      • Performs cost-based optimizations and determines the most efficient query execution plan.
  • Characteristics:
    • Single instance in a cluster.
    • Does not process data but orchestrates the execution of queries.

2. Workers

  • Workers are the nodes that execute the actual query tasks.
  • Responsibilities:
    • Process data by reading from data sources, filtering, aggregating, and joining datasets.
    • Perform query execution in parallel.
    • Exchange intermediate results with other workers as required (e.g., during a distributed join or aggregation).
  • Characteristics:
    • Highly scalable; you can add more workers to handle larger workloads.
    • Communicate directly with data sources and with each other for intermediate results.

Hive Metastore or AWS Glue Catalog

Trino does not strictly depend on the Hive Metastore, but it often leverages it for metadata management, especially when querying data in systems like Hadoop, HDFS, or data lakes using file formats such as ORC and Parquet.


When Trino Uses the Hive Metastore

  • Metadata Management:
    • The Hive Metastore provides schema information (e.g., table definitions, column types, and partitioning details) for datasets stored in data lakes (HDFS, S3, etc.).
  • Query Optimization:
    • Trino uses metadata from the Hive Metastore for optimizations like pruning partitions or determining file locations.

You do need catalog e.g. if you want to read data from RDBMS and then write to S3

Please read this blog for steps to install Hive Standalone Metastore, Trino. It is kind of complex topic and involves many steps.
 This blog article lays down steps, although it is few years old but concepts will be same. 
Trino latest version 467 requires JDK 23 version so modify steps to install JDK 23.


Trino Installation

Step 1: Prerequisites

Ensure Java 23 is Installed Check if Java is installed:

Trino version 467 requires JDK-23

java -version

  1. If it's not installed, do the following:

Ubuntu:

In case if openjdk-23 is not available in your default repo


sudo add-apt-repository ppa:openjdk-r/ppa

sudo apt update

sudo apt install openjdk-23-jdk


CentOS/RHEL:

sudo yum install java-23-openjdk


  1. Memory Requirements Make sure your system has at least 4GB of RAM.


Step 2: Download Trino 467 ( or you could install whatever latest version is available)

Download the Trino 467 tarball:

Create trino user

sudo adduser trino

sudo su - trino

Assuming your home folder is /home/trino


wget https://repo1.maven.org/maven2/io/trino/trino-server/467/trino-server-467.tar.gz



Step 3: Extract the Archive

Extract the downloaded tarball:


tar -xvzf trino-server-467.tar.gz

## You can create soft link trino-server which points to trino-server-467

## 

ln -s trino-server-467 trino-server

cd trino-server-467 or trino-server



Step 4: Configure Trino

etc   folder keeps all configuration files. You can create directory etc or keep config files outside of trino-server install so when you upgrade Trino version you don't have to copy config files but instead just create soft link. Just to keep things simple at this point, lets create etc folder inside 


You can create the etc directory inside or trino-server-467

mkdir etc


mkdir -p /home/trino/data ## it will be used later


  1. Add the following configuration files:

node.properties (defines the node settings):


##node.id will change from one node to another node and should be unique.

## You can instal uuid package to generate unique id e.g. l sudo apt install -y uuid 

## node.data-dir: The location (filesystem path) of the data directory. Trino stores logs and other data here



node.environment=test

node.id=unique-node-id

node.data-dir=/home/trino/data


jvm.config (configures JVM options):
## Change Xmx based on your memory, usually put 70% of memory to JVM


-verbose:class

-server

-Xmx4G

-XX:+UseG1GC

-XX:G1HeapRegionSize=32M

-XX:+UseGCOverheadLimit

-XX:+ExplicitGCInvokesConcurrent

-XX:+HeapDumpOnOutOfMemoryError

-XX:+ExitOnOutOfMemoryError

-Djdk.attach.allowAttachSelf=true

-Dcom.sun.management.jmxremote.rmi.port=9081




config.properties (basic Trino server settings):


The following is a minimal configuration for the coordinator:

coordinator=true

node-scheduler.include-coordinator=false

http-server.http.port=8080

discovery.uri=http://example.net:8080

And this is a minimal configuration for the workers:

coordinator=false

http-server.http.port=8080

discovery.uri=http://your-trino-host:8080

Alternatively, if you are setting up a single machine for testing, that functions as both a coordinator and worker, use this configuration:

coordinator=true

node-scheduler.include-coordinator=true

http-server.http.port=8080

discovery.uri=http://your-trino-host:8080



Add log.properties with this content

io.trino=INFO

This is basic etc folder looks like. Create a catalog folder inside it also. This will hold

all connectors properties.

mkdir -p etc/catalog

etc$ ls -ltr

total 44

-rw-r--r-- 1 trino trino   14 Nov 27 05:56 log.properties

-rwxr-xr-x 1 trino trino   85 Dec 15 06:23 trino-env.sh

-rw-r--r-- 1 trino trino  264 Dec 16 03:33 jvm.config

drwxr-xr-x 2 trino trino 4096 Dec 16 03:43 catalog

-rw-r--r-- 1 trino trino  185 Dec 17 20:38 config.properties

-rw-r--r-- 1 trino trino  189 Dec 21 06:40 node.properties

cat trino-env.sh 

export JAVA_HOME=/usr/lib/jvm/java-23-openjdk-amd64

export PATH=$JAVA_HOME/bin:$PATH

Step 5: Start Trino

Start the server using the included launcher script:

bin/launcher start


To check the status:

bin/launcher status


To stop the server:

bin/launcher stop


Log files are located under /home/trino/data/var/log/server.log. Where you have setup 
/home/trino/data in node.data-dir property in node.properties 



Step 6: Verify Installation

  1. Open your web browser and navigate to: http://trino-host:8080

  2. Use the Trino Web UI to monitor the server Trino. It will shows all nodes , queries getting executed or finished etc.

your log file will display SERVER STARTED

2024-12-21T06:45:11.169Z INFO main io.trino.eventlistener.EventListenerManager -- Loaded event listener /home/trino/data/etc/mysql-event-listener.properties --
2024-12-21T06:45:11.249Z INFO main io.trino.server.Server Server startup completed in 18.22s
2024-12-21T06:45:11.249Z INFO main io.trino.server.Server ======== SERVER STARTED ========
  

Download Trino Command Line Client

e.g
wget https://repo1.maven.org/maven2/io/trino/trino-cli/467/trino-cli-467-executable.jar

After downloading you can rename file trino-cli-467-executable.jar to trino and copy or mv to /usr/local/bin
e.g. mv trino-cli-467-executable.jar /usr/local/bin/trino
chmod 755 /usr/local/bin/trino

To connect to Trino server from client
If you are running on same host where you have installed Trino Server
you can simply do

trino
trino> show catalogs ;

If you are running client from any other host
trino --server http://your-trino-host:8080 


Create Catalog

Create Catalog for MySQL Database


CREATE CATALOG mysql_test USING mysql

WITH (

  "connection-url" = 'jdbc:mysql://my-sql-end-point:3306',

  "connection-user" = 'mysql_user',

  "connection-password" = 'mysql_password'

);


## Now show catalogs will show catalog name mysql_test
##
trino> show schemas from mysql_test ;
## It will show you all schemas/database from catalog mysql_test
## Show all tables from Catalog name mysql_test and database name schema_name.
## Schema_name/Database Name are used interchangeably
trino> show tables from mysql_test.schema_name ;
## Select from one table, show only 10 rows on screen
trino > select * from mysql_test.schema_name. table_name limit 10 ;

## Trino uses catalogname.schema_name.table_name syntax

Create Catalog for SQL Server Database

CREATE CATALOG sqlserver_test USING sqlserver WITH ( "connection-url" = 'jdbc:sqlserver://sql=server-host:1433;databaseName=db_name;encrypt=false', "connection-user" = 'db_user', "connection-password" = 'db_password' );

You could join tables from different catalogs, same catalog, same catalog and different database etc


Join tables from two different catalogs

select ...
from mysql_test.db_name.table_name as A , sqlserver_test.db_name,table_name as B
where A.column=B.column

Storing Resultset to S3

## Before we can do that, it requires use of some metastore to store table metadata so for that either you can create Hive Standalone mteastore or use AWS Glue

## Assuming you have create hive metastore and it is configured to use S3

## Check this link in doc Querying S3 Object Stores with  Trino for configuring Hive Metastore & S3 setup

CREATE SCHEMA hive.my_schema WITH (location = 's3://s3_bucket/data/');

We are are familiar with CTAS method
e.g.
Create Table S3Table As select ..... ;


Thursday, December 19, 2024

Apache Ranger, Solr & Trino

What is Apache Ranger

Apache Ranger is an open-source data security and governance framework designed for managing and enforcing policies for accessing and protecting sensitive data within a Hadoop ecosystem. It provides a centralized platform to define, administer, and audit access control policies across various Hadoop services, ensuring robust data governance and compliance.

Key Features of Apache Ranger:

  • Centralized Policy Management:
    • Allows administrators to define fine-grained access control policies for multiple Hadoop components such as HDFS, Hive, HBase, Kafka, and others.
  • Audit and Reporting :
      • Captures and logs all user access attempts, including successful and failed ones, and provides detailed audit trails.

      • Integration with tools like Solr, Elasticsearch and Kibana for analyzing and visualizing audit logs.
    • Core Components of Apache Ranger:

      • Ranger Admin:
        • A centralized web interface to create, update, and manage security policies.
        • Administrators can define policies for various Hadoop services.
      • Ranger Plugins:
        • Lightweight components installed on individual Hadoop services (e.g., Hive, HDFS, HBase, Kafka, Trino/Presto).
        • Enforce security policies in real-time by intercepting access requests and checking them against defined policies.
      • User Sync Service:
        • Synchronizes users and groups from external directories like LDAP or Active Directory into Apache Ranger.
      • Audit Framework:
        • Collects and stores detailed audit logs of access events, policy evaluations, and security violations.

 

Installating Apache Ranger

Apache Ranger Quick Install Guide 

On host where you want to install Apache Ranger on Ubuntu 20.04

You may have to install git, JAVA ( jdk8) , mvn, python3 if not already installed on your system
create two unix users ranger & solr
e.g. 
sudo adduser ranger
sudo adduser solr
## su to ranger account
## Download Ranger repo and build
su - ranger
mkdir ranger 
cd ranger
git clone https://gitbox.apache.org/repos/asf/ranger.git
## set JAVA_HOME as per your location
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre

mvn -Pall clean 
mvn -Pall -DskipTests=false clean compile package install
After the above build command execution, you should see the following TAR files in the target folder. You will see many tar.gz files in target folder and also base folder as well.
In my case, I see ranger-3.0.0-SNAPSHOT-admin.tar.gz, it could be different version based on current latest version
I have renamed ranger-3.0.0-SNAPSHOT-admin.tar.gz to ranger-3.0.0-admin.tar.gz
Ranger Admin Tool Component (ranger-%version-number%-admin.tar.gz) should be 
installed on a host where Policy Admin Tool web application runs on port 6080 (default).
mkdir -p /usr/lib/ranger/
##copy ranger-3.0.0-admin.tar.gz to /usr/lib/ranger/
cp ranger-3.0.0-admin.tar.gz /usr/lib/ranger/ ## Before we install Apache Ranger Admin, install solr
## on Base folder where you build/ran mvn command, goto following folder security-admin/contrib/solr_for_audit_setup and edit install.properties file ( this is for solr install) edit install.properties SOLR_USER=solr SOLR_GROUP=solr SOLR_DOWNLOAD_URL=http://archive.apache.org/dist/lucene/solr/8.9.0/solr-8.9.0.tgz SOLR_INSTALL=true ## save & exit install.properties ## execute ./setup.sh to install solr ./setup.sh ## Following step is due to error in Solr. This error is shown you goto Solr URL ## and See error in Dashboard. ##ranger_audits: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Error initializing QueryElevationComponent cp -p /opt/solr/example/files/conf/elevate.xml /opt/solr/ranger_audit_server/ranger_audits/conf/ ## ## Start Solr /opt/solr/ranger_audit_server/scripts/start_solr.sh ## Test solr URL is working ## http://host-name:6083/solr ## To Stop Solr /opt/solr/ranger_audit_server/scripts/stop_solr.sh
Solr Dashboard
Solr Dashboard

## Install Apache Ranger Admin ## Database Pre-req ## Before you can install Apache Ranger Admin, you either need an existing MySQL/Postgres ## Database or create new one to hold metadata for Apache Ranger ## In my case , I have mysql DB running on same host ## On MySQL database, create a empty database ranger and also create user rangeradmi ## MySQL version 8.0.x ## Was getting issue with default char set utfmb4 so I choose utf8mb3 for ranger admin ## install script to work CREATE DATABASE `ranger` DEFAULT CHARACTER SET utf8mb3 ; CREATE USER `rangeradmin`@`%` IDENTIFIED WITH 'mysql_native_password' YourPassword ; grant all on *.ranger TO `rangeradmin`@`%` ; ## Database pre-req end cd /usr/lib/ranger/ tar -xvfz ranger-3.0.0-admin.tar.gz ## tar will create ranger-3.0.0-admin folder under /usr/lib/ranger/
cd /usr/lib/ranger/ranger-3.0.0-admin
## You need to edit install.properties file

## Change following lines based on your setting
## You need database 
DB_FLAVOR=MYSQL
## MySQL JDBC connector, you may need to download this 
## You can download MySQL connector from here
## create a symlink 
## cd /usr/share/java
## ln -s /usr/share/java/mysql-connector-j-8.0.33.jar mysql-connector-java.jar
SQL_CONNECTOR_JAR=/usr/share/java/mysql-connector-java.jar

db_root_user=root
db_root_password=****** db_host=localhost # # DB UserId used for the Ranger schema # We did create ranger database and rangeradmin user earlier in pre-req db_name=ranger db_user=rangeradmin db_password=**** ## # * audit_store is solr audit_store=solr audit_solr_urls=http://localhost:6083/solr/ranger_audits audit_solr_user=solr ## save & exit install.properties file Execute ./setup.sh In case if you setup.sh has error in following step. 2024-12-16 20:23:56,315 [I] --------- Verifying Ranger DB connection --------- 2024-12-16 20:23:56,315 [I] Checking connection.. 2024-12-16 20:23:56,315 [JISQL] /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -cp /usr/share/java/mysql-connector-java.jar:/usr/lib/ranger/ranger-3.0.0-admin/jisql/lib/* org.apache.util.sql.Jisql -driver mysqlconj -cstring jdbc:mysql://localhost/ranger?useSSL=false -u 'rangeradmin' -p '********' -noheader -trim -c \; -query "select 1;" Loading class `com.mysql.jdbc.Driver'. This is deprecated. The new driver class is `com.mysql.cj.jdbc.Driver'. The driver is automatically registered via the SPI and manual loading of the driver class is generally unnecessary. 2024-12-16 20:23:56,756 [I] Checking connection failed. It is due to fact allowPublicKeyRetrieval=true to be added and it is done by editing db_setup.py file in same folder edit db_setup.py file and change ( add allowPublicKeyRetrieval=true ) in the following line
if "useSSL" not in db_name: db_ssl_param="useSSL=false" To if "useSSL" not in db_name: db_ssl_param="?allowPublicKeyRetrieval=true&useSSL=false" ########## ### setup.sh will install ranger-admin, you will see /usr/bin/ranger-admin
### To start/stop ranger admin /usr/bin/ranger-admin start/stop ## Check Ranger Admin URL ##http://your-ranger-host:6080/ ## Rager admin default user is admin and password is also admin ## You have successfully installed Ranger Admin :)

Trino Apache Ranger Plug-in enable
You will see Trino in Apache Ranger Service Manager, click on + sign to add Trino You need to have handly your Trino Server JDBC URL ## My existing setup of Trino doesn't have any auth enabled ## We will enable to audit for any query run







Under Ranger Admin Dashboard, you will see Resource Policies , click there, you will see Trino as well.
You can edit Policies here e.g. I have added extra Users , by default only admin was there. I have added {USER}. I have added {USER} so any user can impersonate e.g. in trino
## Apache Ranger has users, groups, roles etc and it has integration with AD, LDAP etc as well
## Just to keep things simple, you can create users manually instead of just admin only.
trino --user=sanjay





 Trino Server Side configuration

Learn about Trino


Trino Version 466 or 467 , See this blog  Now Trino has plug-in for Apache Ranger and it is no longer dependent on Apache Ranger Java version.

configure access-control.properties on co-ordinator host of Trino

cat access-control.properties


access-control.name=ranger
ranger.service.name=trino
ranger.plugin.config.resource=etc/ranger-trino-security.xml,etc/ranger-trino-audit.xml

####

cat ranger-trino-audit.xml


<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration xmlns:xi="http://www.w3.org/2001/XInclude">
<property>
<name>xasecure.audit.is.enabled</name>
<value>true</value>
<description>Boolean flag to specify if the plugin should generate access audit logs. Default: true</description>
</property>


<property>
<name>xasecure.audit.solr.is.enabled</name>
<value>true</value>
<description>Boolean flag to specify if audit logs should be stored in Solr. Default: false</description>
</property>
<property>
<name>xasecure.audit.solr.solr_url</name>
<value>http://your-host-where-solr-installed:6083/solr/ranger_audits</value>
<description>URL to Solr deployment where the plugin should send access audits to</description>
</property>

</configuration>

### Restart Trino Server

### See Audit logs on Ranger Dashboard, there is audit, you will start seeing audit logs