1 billion rows challenge in Trino
Following article mentions 1 billion rows challenge and uses Postgres & Click House so I thought to do same test in Trino query engine.
My Trino cluster is setup on AWS EC2 using only 2 nodes ( 1 Master/Coordinator node and 1 Worker Node and each node is r6i.2xlarge 8vcpu and 64GB RAM ). Master Node also runs hive standalone metastore.
Based on article, I was able to generate data and data is about 12.6 GB size. I have copied data to S3 bucket (s3://my_bucket/warehouse/measurement_ext) as is using trino/hive external table.
hive catalog and b2b is schema in catalog.
trino-cli –catalog hive
drop table hive.b2b.measurements_ext ;
-- Create External Table pointing to csv file on S3
CREATE TABLE hive.b2b.measurements_ext (
station_name VARCHAR,
measurement VARCHAR
)
WITH (
format = 'CSV',
csv_separator = ';' ,
external_location = 's3://my_bucket/warehouse/measurement_ext/'
);
-- Load Data into Hive/Trino to another table in Parquet format and create 30 buckets from external table.
Drop Table hive.b2b.measurements ;
CREATE TABLE hive.b2b.measurements (measurement, station_name)
WITH ( format = 'PARQUET' , bucketed_by=ARRAY['station_name'], bucket_count = 30 )
AS
Select cast ( measurement as double) as measurement , station_name from hive.b2b.measurements_ext ;
trino:b2b> CREATE TABLE hive.b2b.measurements (measurement, station_name)
-> WITH ( format = 'PARQUET' , bucketed_by=ARRAY['station_name'], bucket_count = 30 )
-> AS
-> Select cast ( measurement as double) as measurement , station_name from hive.b2b.measurements_ext ;
->
CREATE TABLE: 1000000000 rows
Query 20240107_035847_00022_3hbbp, FINISHED, 2 nodes
Splits: 319 total, 319 done (100.00%)
21:25 [1000M rows, 1.61TB] [778K rows/s, 1.28GB/s]
Lets run analytics Query
WITH AGG AS(
SELECT station_name,
MIN(measurement) min_measurement,
AVG(measurement) mean_measurement,
MAX(measurement) max_measurement
FROM hive.b2b.measurements
GROUP BY station_name
)
SELECT listagg ( station_name || '=' || CONCAT_WS('/',cast ( min_measurement as varchar ), cast (mean_measurement as varchar), cast( max_measurement as varchar) ) ,',' ) WITHIN GROUP ( Order By station_name)
FROM AGG;
trino:b2b> WITH AGG AS(
-> SELECT station_name,
-> MIN(measurement) min_measurement,
-> AVG(measurement) mean_measurement,
-> MAX(measurement) max_measurement
-> FROM hive.b2b.measurements
-> GROUP BY station_name
-> )
-> SELECT listagg ( station_name || '=' || CONCAT_WS('/',cast ( min_measurement as varchar ), cast (mean_measurement as varchar), cast( max_measurement as varchar) ) ,',' ) WITHIN GROUP ( Order By station_name)
-> FROM AGG;
->
>
------------------------------------------------------------------------------------------------------------------------------------------------------------>
Abha=-3.34E1/1.801007455768859E1/6.63E1,Abidjan=-2.44E1/2.5999503615664945E1/7.42E1,Abéché=-2.28E1/2.9396477080815338E1/8.03E1,Accra=-2.66E1/2.639164497914>
(1 row)
Query 20240107_042948_00026_3hbbp, FINISHED, 1 node
Splits: 77 total, 77 done (100.00%)
12.81 [1000M rows, 1.71GB] [78.1M rows/s, 137MB/s]
trino:b2b>