1 billion rows challenge in Trino

Following article mentions 1 billion rows challenge and uses Postgres & Click House so I thought to do same test in Trino query engine.

My Trino cluster is setup on AWS EC2 using only 2 nodes ( 1 Master/Coordinator node and 1 Worker Node and each node is r6i.2xlarge 8vcpu and 64GB RAM ). Master Node also runs hive standalone metastore.

Based on article, I was able to generate data and data is about 12.6 GB size. I have copied data to S3 bucket (s3://my_bucket/warehouse/measurement_ext) as is using trino/hive external table.

hive catalog and b2b is schema in catalog.

trino-cli –catalog hive

drop table hive.b2b.measurements_ext ;

-- Create External Table pointing to csv file on S3

CREATE TABLE hive.b2b.measurements_ext (

station_name VARCHAR,

measurement VARCHAR

)

WITH (

format = 'CSV',

csv_separator = ';' ,

external_location = 's3://my_bucket/warehouse/measurement_ext/'

);

-- Load Data into Hive/Trino to another table in Parquet format and create 30 buckets from external table.

Drop Table hive.b2b.measurements ;

CREATE TABLE hive.b2b.measurements (measurement, station_name)

WITH ( format = 'PARQUET' , bucketed_by=ARRAY['station_name'], bucket_count = 30 )

Select cast ( measurement as double) as measurement , station_name from hive.b2b.measurements_ext ;

trino:b2b> CREATE TABLE hive.b2b.measurements (measurement, station_name)

-> WITH ( format = 'PARQUET' , bucketed_by=ARRAY['station_name'], bucket_count = 30 )

-> AS

-> Select cast ( measurement as double) as measurement , station_name from hive.b2b.measurements_ext ;

CREATE TABLE: 1000000000 rows

Query 20240107_035847_00022_3hbbp, FINISHED, 2 nodes

Splits: 319 total, 319 done (100.00%)

21:25 [1000M rows, 1.61TB] [778K rows/s, 1.28GB/s]

Table measurements took about 21 mis 25 seconds to copy data from external table (csv) format to Parquet format for 1 Billion rows.

Lets run analytics Query

WITH AGG AS(

SELECT station_name,

MIN(measurement) min_measurement,

AVG(measurement) mean_measurement,

MAX(measurement) max_measurement

FROM hive.b2b.measurements

GROUP BY station_name

)

SELECT listagg ( station_name || '=' || CONCAT_WS('/',cast ( min_measurement as varchar ), cast (mean_measurement as varchar), cast( max_measurement as varchar) ) ,',' ) WITHIN GROUP ( Order By station_name)

FROM AGG;

trino:b2b> WITH AGG AS(

-> SELECT station_name,

-> MIN(measurement) min_measurement,

-> AVG(measurement) mean_measurement,

-> MAX(measurement) max_measurement

-> FROM hive.b2b.measurements

-> GROUP BY station_name

-> )

-> SELECT listagg ( station_name || '=' || CONCAT_WS('/',cast ( min_measurement as varchar ), cast (mean_measurement as varchar), cast( max_measurement as varchar) ) ,',' ) WITHIN GROUP ( Order By station_name)

-> FROM AGG;

------------------------------------------------------------------------------------------------------------------------------------------------------------>

Abha=-3.34E1/1.801007455768859E1/6.63E1,Abidjan=-2.44E1/2.5999503615664945E1/7.42E1,Abéché=-2.28E1/2.9396477080815338E1/8.03E1,Accra=-2.66E1/2.639164497914>

(1 row)

Query 20240107_042948_00026_3hbbp, FINISHED, 1 node

Splits: 77 total, 77 done (100.00%)

12.81 [1000M rows, 1.71GB] [78.1M rows/s, 137MB/s]

trino:b2b>

Query took 12.81 seconds to complete ( Assuming data is available in S3 Parquet format)

Database & Data Engineering Topics

About Me

Saturday, January 6, 2024