Commit Graph

19 Commits

Author SHA1 Message Date
yan zhang ba48ea9526
[BugFix] Fix Child Class Loader of JNI Readers (#60163)
Signed-off-by: yan zhang <dirtysalt1987@gmail.com>
2025-06-24 18:44:17 +08:00
RyanZ 2787b44ddc
[Enhancement] supprt incompatible avro schema (#57296)
Signed-off-by: yanz <dirtysalt1987@gmail.com>
2025-03-26 14:22:52 +08:00
Vikas Attiguppa 50b1a91924
[Enhancement] Fixing CVEs (#54749)
Signed-off-by: Vikas Attiguppa <20652333+va-os-commits@users.noreply.github.com>
2025-01-08 16:33:29 -08:00
Smith Cruise b47bdebcfc
[Enhancement] Fix cve problems in java-extensions module (#49425)
Signed-off-by: Smith Cruise <chendingchao1@126.com>
2024-08-08 17:05:40 +08:00
Smith Cruise 6b01a165c4
[Enhancement] Fix some hudi cve problems (#49157)
Signed-off-by: Smith Cruise <chendingchao1@126.com>
2024-08-02 10:32:08 +08:00
stephen 0066149594
[Feature] support to query iceberg refs table (#48972)
Signed-off-by: stephen <stephen5217@163.com>
2024-07-29 15:26:41 +08:00
裸奔丶小馒头 c349de8ac9
[BugFix] Fix JniScanner crash due to struct null indicator (#46492)
Signed-off-by: changxin <streakxin@foxmail.com>
2024-06-07 11:46:25 +00:00
裸奔丶小馒头 b8cbc29f09
[BugFix] Fix the crash caused by JniScanner (#44903)
Signed-off-by: changxin <streakxin@foxmail.com>
2024-05-24 10:33:42 +08:00
Smith Cruise 0fcf3d7eba
[Enhancement] Bump FE/BE's hadoop to 3.4.0 (#45312)
Why I'm doing:
For the CVE problem, we need to upgrade Hadoop SDK from 3.3.6 -> 3.4.0
It will introduce aws java SDK v2, so we can delete SDK v1.

Signed-off-by: Smith Cruise <chendingchao1@126.com>
2024-05-16 14:40:26 +08:00
RyanZ 1569f589dc
[BugFix] fix empty required fields in jni scanner (#45568)
Signed-off-by: yanz <dirtysalt1987@gmail.com>
2024-05-14 13:59:48 +08:00
RyanZ b6ca919bf7
[Feature] Optimize `count(1)` in hdfs scanner by rewriting plan to `sum` (#43616)
Why I'm doing:
Rigjht now hdfs scanner optimization on count(1) is to output const column of expected count.

And we can see in extreme case(large dataset), the chunk number flows in pipeline will be extremely huge, and operator time and overhead time is not neglectable.

And here is a profile of select count(*) from hive.hive_ssb100g_parquet.lineorder. To reproduce this extreme case, I've changed code to scale morsels by 20x and repeat row groups by 10x.

in concurrency=1 case , total time is 51s

         - OverheadTime: 25s37ms
           - __MAX_OF_OverheadTime: 25s111ms
           - __MIN_OF_OverheadTime: 24s962ms

             - PullTotalTime: 12s376ms
               - __MAX_OF_PullTotalTime: 13s147ms
               - __MIN_OF_PullTotalTime: 11s885ms
What I'm doing:
Rewrite the count(1) query to sum like. So each row group reader will only emit at one chunk(size = 1).

And total time is 9s.

Original plan is like

+----------------------------------+
| Explain String                   |
+----------------------------------+
| PLAN FRAGMENT 0                  |
|  OUTPUT EXPRS:18: count          |
|   PARTITION: UNPARTITIONED       |
|                                  |
|   RESULT SINK                    |
|                                  |
|   4:AGGREGATE (merge finalize)   |
|   |  output: count(18: count)    |
|   |  group by:                   |
|   |                              |
|   3:EXCHANGE                     |
|                                  |
| PLAN FRAGMENT 1                  |
|  OUTPUT EXPRS:                   |
|   PARTITION: RANDOM              |
|                                  |
|   STREAM DATA SINK               |
|     EXCHANGE ID: 03              |
|     UNPARTITIONED                |
|                                  |
|   2:AGGREGATE (update serialize) |
|   |  output: count(*)            |
|   |  group by:                   |
|   |                              |
|   1:Project                      |
|   |  <slot 20> : 1               |
|   |                              |
|   0:HdfsScanNode                 |
|      TABLE: lineorder            |
|      partitions=1/1              |
|      cardinality=600037902       |
|      avgRowSize=5.0              |
+----------------------------------+
And rewritted plan is like

+-----------------------------------+
| Explain String                    |
+-----------------------------------+
| PLAN FRAGMENT 0                   |
|  OUTPUT EXPRS:18: count           |
|   PARTITION: UNPARTITIONED        |
|                                   |
|   RESULT SINK                     |
|                                   |
|   3:AGGREGATE (merge finalize)    |
|   |  output: sum(18: count)       |
|   |  group by:                    |
|   |                               |
|   2:EXCHANGE                      |
|                                   |
| PLAN FRAGMENT 1                   |
|  OUTPUT EXPRS:                    |
|   PARTITION: RANDOM               |
|                                   |
|   STREAM DATA SINK                |
|     EXCHANGE ID: 02               |
|     UNPARTITIONED                 |
|                                   |
|   1:AGGREGATE (update serialize)  |
|   |  output: sum(19: ___count___) |
|   |  group by:                    |
|   |                               |
|   0:HdfsScanNode                  |
|      TABLE: lineorder             |
|      partitions=1/1               |
|      cardinality=1                |
|      avgRowSize=1.0               |
+-----------------------------------+
Fixes #45242

Signed-off-by: yanz <dirtysalt1987@gmail.com>
2024-05-10 09:26:45 +08:00
Yi d7399a1b61
[BugFix] fix jni-reader for struct type when writing nulls to off heap (#42285)
Signed-off-by: Yi Wang <connorwang@live.com>
2024-05-08 22:08:46 +08:00
leoyy0316 cdf22824ef
[Enhancement]Enhance JNI reader for date and timestamp type (#38537)
Signed-off-by: leoyy0316 <571684903@qq.com>
2024-01-09 21:43:08 +08:00
Yi e2db8b8ebd
[BugFix] fix Timestamp cast exception when Hive-Serde with version 2.x.x is used (#37185)
Signed-off-by: Yi Wang <connorwang@live.com>
2023-12-27 22:16:06 +08:00
Yi e658d5b802
[Enhancement] Support Hive Table Formats that relies on SerDe properties to be correctly deserialized (#37182)
Signed-off-by: Yi Wang <connorwang@live.com>
2023-12-27 11:32:50 +08:00
Zhang Yifan 17292309f2
[BugFix] fix ArrayIndexOutOfBoundsException in HiveScanner (#36177)
Signed-off-by: zhangyifan27 <chinazhangyifan@163.com>
2023-12-05 13:02:54 +08:00
before-Sunrise a68e29b942
[BugFix] hive jni scanner with rcbinary deal with timezone (#36371)
Signed-off-by: before-Sunrise <unclejyj@gmail.com>
2023-12-04 20:55:17 +08:00
Felix Li 5cc1c6ef26
[Enhancement] Upgrade guava to 32.0.1-jre due to CVE-2023-2976 (#34379)
Signed-off-by: Astralidea <astralidea@163.com>
2023-11-06 14:05:13 +08:00
before-Sunrise 1d7b0efc6d
[Feature] support avro/sequence file/ rcfile for hive table and external file table in jni scanner (#34028)
Signed-off-by: before-Sunrise <unclejyj@gmail.com>
2023-11-03 19:52:02 +08:00