About Me

Fremont, CA, United States

Saturday, January 6, 2024

1 billion rows challenge in Trino

 

1 billion rows challenge in Trino

Following article  mentions 1 billion rows challenge and uses Postgres & Click House so I thought to do same test in Trino query engine.

My Trino cluster is setup on AWS EC2 using only 2 nodes ( 1 Master/Coordinator node and 1 Worker Node and each node is r6i.2xlarge 8vcpu and 64GB RAM ). Master Node also runs hive standalone metastore.





Based on article, I was able to generate data and data is about 12.6 GB size. I have copied data to S3 bucket (s3://my_bucket/warehouse/measurement_ext)   as is using trino/hive external table.

hive catalog and b2b is schema in catalog.

trino-cli –catalog hive

 

drop table hive.b2b.measurements_ext ;


-- Create External Table pointing to csv file on S3


CREATE TABLE hive.b2b.measurements_ext (

    station_name VARCHAR,

    measurement VARCHAR

)

WITH (

    format = 'CSV',

    csv_separator = ';' ,

    external_location = 's3://my_bucket/warehouse/measurement_ext/'

);


-- Load Data into Hive/Trino to another table in Parquet format and create 30 buckets from external table.


Drop Table hive.b2b.measurements ;


CREATE TABLE hive.b2b.measurements (measurement, station_name)

WITH ( format = 'PARQUET' , bucketed_by=ARRAY['station_name'], bucket_count = 30 )

AS

Select  cast ( measurement as double) as measurement , station_name from hive.b2b.measurements_ext ;




trino:b2b> CREATE TABLE hive.b2b.measurements (measurement, station_name)

        -> WITH ( format = 'PARQUET' , bucketed_by=ARRAY['station_name'], bucket_count = 30 )

        -> AS

        -> Select  cast ( measurement as double) as measurement , station_name from hive.b2b.measurements_ext ;

        -> 

CREATE TABLE: 1000000000 rows


Query 20240107_035847_00022_3hbbp, FINISHED, 2 nodes

Splits: 319 total, 319 done (100.00%)

21:25 [1000M rows, 1.61TB] [778K rows/s, 1.28GB/s]


Table measurements took about 21 mis 25 seconds to copy data from external table (csv) format to Parquet format for 1 Billion rows.


Lets run analytics Query

WITH AGG AS(

    SELECT station_name, 

           MIN(measurement)  min_measurement,

           AVG(measurement)  mean_measurement,

           MAX(measurement)  max_measurement

    FROM hive.b2b.measurements

    GROUP BY station_name

    )

SELECT listagg ( station_name || '=' || CONCAT_WS('/',cast ( min_measurement as varchar ), cast (mean_measurement as varchar), cast( max_measurement as varchar) ) ,',' ) WITHIN GROUP ( Order By station_name)

    FROM AGG;




trino:b2b> WITH AGG AS(

        ->     SELECT station_name, 

        ->            MIN(measurement)  min_measurement,

        ->            AVG(measurement)  mean_measurement,

        ->            MAX(measurement)  max_measurement

        ->     FROM hive.b2b.measurements

        ->     GROUP BY station_name

        ->     )

        -> SELECT listagg ( station_name || '=' || CONCAT_WS('/',cast ( min_measurement as varchar ), cast (mean_measurement as varchar), cast( max_measurement as varchar) ) ,',' ) WITHIN GROUP ( Order By station_name)

        ->     FROM AGG;

        -> 

                                                                                                                                                            >

------------------------------------------------------------------------------------------------------------------------------------------------------------>

 Abha=-3.34E1/1.801007455768859E1/6.63E1,Abidjan=-2.44E1/2.5999503615664945E1/7.42E1,Abéché=-2.28E1/2.9396477080815338E1/8.03E1,Accra=-2.66E1/2.639164497914>

(1 row)


Query 20240107_042948_00026_3hbbp, FINISHED, 1 node

Splits: 77 total, 77 done (100.00%)

12.81 [1000M rows, 1.71GB] [78.1M rows/s, 137MB/s]


trino:b2b>



Query took 12.81 seconds to complete ( Assuming data is available in S3 Parquet format)