In such situation customers want to maintain " high-fidelity" of data types and model to the respective source systems, over which we have no-control. And then all the transformations (technical and business) happen within Vertica. At some customer's sites data from several different independent data sources (and databases) are continuously ingested into Vertica with minimal transformation. Vertica customers who build data-lake in Vertica tend to lean towards ELT than ETL. While I understand the impact of using "wide" columns, in the industry and Vertica customers the following points come into play.ġ. I agree with your overall assessment and guidance. My suggestion is to run a SELECT MAX(OCTET_LENGTH(RTRIM(s))) for all suspicious strings, and use the result for re-dimensioning your VARCHARs.įix the problems as high up in the food chain as you , This applies to BI tools, ETL tools, and your favourite SQL client, to name a few.Īnd, if you use your judgment and size your columns wisely, with JoinDefaultTupleFormat set to "variable", you're then slower than with fixed formats, where simple pointer arithmetics can let you navigate between tuples, whereas you always have to calculate at atomic level for navigating the memory with variable buffer lengths.
You'll run out of memory quickly on the middle tier or client side - or both. And those maximum possible lengths will be multiplied by the rowset - the number of rows you fetch with each fetch call. Setting JoinDefaultTupleFormat to "variable" will not resolve all issues with oversized column types: Loading, sorting will remain penalised at database level and, what hurts much more: all SQL clients - even vsql - will over-allocate fetch buffers for the maximum possible lengths. We "right-sized" their VARCHARs, and voila, performance went through the roof!Īs a matter of fact, yes, there is a downside in doing so. When I first started working with Vertica as a consultant back in 2011, a client had defined ALL of their character fields as VARCHAR(65000)! Needless to say, they were complaining about query performance. +-+-ĭbadmin=> SELECT peak_memory_kb FROM query_consumption WHERE transaction_id = 45035996273991281 AND statement_id = 3 ĭbadmin=> SELECT current_trans_id(), current_statement(), c FROM t_varchar_65000 WHERE c = 'A' ĭbadmin=> SELECT peak_memory_kb FROM query_consumption WHERE transaction_id = 45035996273991281 AND statement_id = 5 IMHO, also anytime if the column is used in segmentation, order by, group by, distinct etc, it will be efficient to specify reasonable/expected length for Varchar column.Ī quick example shows what is refereing to dbadmin=> CREATE TABLE t_varchar_1 (c VARCHAR(1)) ĭbadmin=> CREATE TABLE t_varchar_65000 (c VARCHAR(65000)) ĭbadmin=> INSERT INTO t_varchar_1 SELECT 'A' ĭbadmin=> INSERT INTO t_varchar_65000 SELECT 'A' ĭbadmin=> SELECT current_trans_id(), current_statement(), c FROM t_varchar_1 WHERE c = 'A' This can be mitigated by casting/substring when selecting data from such columns. I have seen long varchar (2.5 Mil) where as the text data is is in the rage of 10-100K.
The other way performance penalty manifest is apart from allocating larger memory foot print if Vertica has to transmit the result to the client, it will use the full length buffers. Though the default value of the parameter is still "fixed" (not variable). Vertica 9.3 introduced JFMT Hint and JoinDefaultTupleFormat (session, database level) configuration parameter that mitigates this problem effectively. The HASH table will be unnecessarily large resulting OOM or DISK_SPILLING. I have seen this happening in production systems. If the VARCHAR(500) (or any large value) is used in JOIN condition using HASH join. There are two way the performance penalty manifests. Primarily because, though the disk storage is used only the length of the data, Memory allocation happens for full length of the column, even if the data is much smaller. Prior to version 9.3, yes there is a performance penalty in Vertica for specifying "large" VARCHAR length.