Compare commits

...

281 Commits

Author SHA1 Message Date
Gavin 1f6d8515f1
[Cherry-pick][Feature] Add a case sensitive flag to hdfs scan node to indicate whether (#9744) (#11275) 2022-09-16 15:52:01 +08:00
Smith Cruise 8fc334de2c
fix client (#10932)
[Enhancement]enable fe to list all file in hdfs recursively
2022-09-07 11:36:33 +08:00
Binglin Chang a9bdb093cd [Bugfix] clear rowsets before load (#9193)
SchemaChange will call _load_from_pb to reload metadata, but _load_from_pb forgets to reset _rowsets vector, so some old rowsets may still be stored in _rowsets, causing no delvec found error, because the delvec corresponds to the old rowset is already removed. This PR clears _rowsets vector before load new rowsets from meta.
2022-07-26 19:55:45 +08:00
hellolilyliuyi fd0b30309f
Branch 2.3 (#9222)
* Update release-2.3.md

* [docs] pipeline-related-changes-in-2.3

* Update release-2.3.md

* Update release-2.3.md

* Update release-2.3.md
2022-07-26 19:49:15 +08:00
Youngwb d6f301b48d [Enhancement] improve CompoundPredicateOperator hashcode performance (#9186)
(cherry picked from commit 8e60a00034)
2022-07-26 15:13:00 +08:00
Sihui 9720ef25a5
Update External_table.md (#9197) 2022-07-26 13:41:14 +08:00
mergify[bot] 49a6d7e7c3
Fix the deadlock of resource group (#9147) (#9174) 2022-07-25 23:04:11 +08:00
rickif 93d614a886
[BugFix] Disable compression in sending chunk by default (#9161) 2022-07-25 21:34:33 +08:00
hellolilyliuyi abed992569
Docs]modify navigation 2.3 (#9184)
* Update release-2.3.md

* [docs]modify navigation_2.3

* Update StarRocks_intro.md

* [docs]modify navigation_2.3
2022-07-25 21:32:06 +08:00
hellolilyliuyi 3c57b790f5
[Docs]modify navigation 2.3 (#9180)
* Update release-2.3.md

* [docs]modify navigation_2.3
2022-07-25 21:20:22 +08:00
zihe.liu b0f1c548d5
[Bugfix] Prefer pipeline parallel for non-local one-phase (#9162) 2022-07-25 20:42:50 +08:00
trueeyu 6de3061591
[Others] Add config for librdkafka debug (#8783) (#9167) 2022-07-25 20:23:41 +08:00
zhangqiang a98a99a333 [BugFix] Primary key length is inconsistent with be (#9148)
PersistentIndex is available in version 2.3, but there are many limitations in key columns. For example, key column is not supported as varchar(char) type, and the length of key columns cannot exceed 64 bytes. However, the calculate logic in fe is inconsistent with be right now(DATE/DATETIME), so we may reject some create request actually we can do.
2022-07-25 18:58:47 +08:00
Pslydhh ba2b3fcd13
[Enhancement] support BIGINT/INT argument for window_funnel (#9032) (#9124) 2022-07-24 13:08:32 +08:00
Youngwb 03bc800be7
[Cherry-pick][Branch-2.3] Fix CompoundPredicateOperator not equals when same children with different order (#8810) (#9103) 2022-07-23 19:36:41 +08:00
zhangqiang 62460094a4 [Enhancement] Reject create table request if checkPersistentIndex failed (#9021)
PersistentIndex is available in branch-2.3, but there are many limitations in key columns. For example, key column is not supported as varchar(char) type, and the length of key columns cannot exceed 64 bytes.

In the previous implementation, if we create a primary key table enable PersistentIndex but the restrictions are not met, we will use PrimaryIndex in memory but not PersistentIndex, and primary key table will be create successfully. This may be a bit confusing to the user.

So we will reject the create table request directly if we find that the restrictions are not met.
2022-07-23 17:23:46 +08:00
Murphy 26f0d36d75
[WIP] [Enahnce] make re2 driver-local to reduce contention (backport #8904) (#9042) 2022-07-23 16:11:20 +08:00
HangyuanLiu 3f974966a7
Fix password parse string bug (#9038) (#9072) 2022-07-23 14:18:38 +08:00
Sihui 6054da00b4
changes on sql statements (#9094) 2022-07-23 14:12:55 +08:00
stdpain 00af088650 [Bugfix] fix "unsupport decode_dict_codes" error in late_materized (#9046)
After PR #8869, GlobalDictCodeColumnIterator won't support `decode_dict_codes(const int32_t* codes, size_t size, vectorized::Column* words)`,
we need to decode_dict_codes(const vectorized::Column& codes, vectorized::Column* words) instead it
2022-07-23 14:08:57 +08:00
stdpain 25d72b3de7 [Bugfix] Fix wrong result when process 'is null' in condition expr in dictionary optimization (#8869) 2022-07-23 14:08:57 +08:00
liuyehcf 4a5216d1ce
[Enhancement] Add invisible session variable 'profile_timeout' (#8999) (#9082) 2022-07-23 10:54:06 +08:00
amber-create cf285345c8
Branch 2.3 (#9056)
* Update Spark_connector.md

* Update Spark_connector.md
2022-07-22 16:29:17 +08:00
amber-create 59dc146a6a
Update Spark_connector.md (#9013) 2022-07-22 16:24:27 +08:00
mergify[bot] 3ac4a3434b
Update sonar4fe.yml (#9027) (#9033)
(cherry picked from commit ae8ac463e5)

Co-authored-by: Stephen <dulong41@gmail.com>
2022-07-22 10:57:10 +08:00
Stephen 345770cefc
Update pom.xml (#9023) 2022-07-22 10:11:05 +08:00
mergify[bot] bb6dca2b57
display data_type in information_schema.columns for (#8895) (#8900) 2022-07-22 07:30:27 +08:00
mergify[bot] 5a2951dbea
[Bug] fix call JNI function in bthread in UDAF when load class failed (#8970) (#9030) 2022-07-22 07:22:17 +08:00
Youngwb 1d98dcc4fd
[BugFix] Disable one stage aggregate with one distinct function (backport #8918) (#8986) 2022-07-21 22:03:51 +08:00
mergify[bot] 8740f3cbc0
[Bugfix] Cancel query when throwing any exception (backport #9005) (#9018) 2022-07-21 21:58:38 +08:00
mergify[bot] 50842a4161
[BugFix] BrokerLoad cann't handle kerberos login with multiple keytab (backport #8820) (#8836) 2022-07-21 21:58:06 +08:00
mergify[bot] 135188a591
[BugFix] Fix LocalTabletsChannel head-use-after-free (#8978) (#9001)
Fix #8906

(cherry picked from commit a43283ae75)

Co-authored-by: Alex Zhu <zhuming9011@gmail.com>
2022-07-21 19:07:53 +08:00
Stephen 837f38d6e8
add sonar cloud check for fe (#8992)
Co-authored-by: dulong <dulong@starrocks.com>
2022-07-21 16:31:58 +08:00
Stephen a07a97167e
add pr documentation label (#8988)
Co-authored-by: dulong <dulong@starrocks.com>
2022-07-21 15:35:04 +08:00
padmejin 85041f8d19 [BugFix] Persist `LoadStatistic` in `EtlStatus` (#8689)
The main reason is that LoadJob did not persist LoadStatistic in the process of either replaying or snapshot. Unfortunately, there is no obvious way to add new metadata since we persist the LoadJob class in a hard-coded way. After due consideration, we serialized LoadStatics in JSON for the convenience of adding/deleting fields in the future and hide this JSON in a deprecated map in EtlStatus.

(cherry picked from commit f931637a65)
2022-07-21 14:12:58 +08:00
rickif aae9599c6c [Enhancement] Add loading error to error_url (#7882) 2022-07-21 11:35:32 +08:00
mergify[bot] 56c72498e9
[BugFix] Fix OlapTableSink close accelerate release resource (#8893) (#8911)
(cherry picked from commit b310b4858a)

Co-authored-by: meegoo <hujie-dlut@qq.com>
2022-07-19 19:21:07 +08:00
hellolilyliuyi 110cb4da15
Update release-2.3.md (#8875) 2022-07-18 22:16:22 +08:00
hellolilyliuyi a4aa2f0a6c
docs add navigation (#8868) 2022-07-18 21:57:25 +08:00
sevev f716a49b73 Fix be crash in ASAN mode 2022-07-18 21:37:06 +08:00
hellolilyliuyi edd1e4c5a4
[docs] Update release-2.3.md (#8834)
* Update release-2.0.md

* Update release-2.3.md
2022-07-18 13:33:34 +08:00
xyz 39a4db537d [BugFix] memory-scratch-sink output order error (#8578)
In current implementation, the output of memory order is based on the tuple_descriptors.
This is not corrent, the right order should base on the output_expr.
Conflicts:
	be/src/util/arrow/starrocks_column_to_arrow.cpp
2022-07-18 12:46:31 +08:00
Seaven d7fb0bc9b0
[BugFix] Fix complex exists/in subquery bug (#8687) (#8765) 2022-07-18 10:54:11 +08:00
hellolilyliuyi 43e09576b2
Update release-2.0.md (#8828) 2022-07-18 10:24:44 +08:00
mergify[bot] 04bc550db0
[Bugfix] fix insert overflow decimal value (backport #7280) (#8625) 2022-07-17 11:13:47 +08:00
mergify[bot] fffff1de1a
[BugFix] support reading viewfs:// (#8582) (#8789) 2022-07-16 09:38:50 +08:00
Li Jiao 61151b357f
fix the relative links under docs/ (#8739) (#8796) 2022-07-15 23:37:48 +08:00
絵空事スピリット a792b15c11
[Doc] Mask host name (#8784) 2022-07-15 19:53:56 +08:00
絵空事スピリット d65590ebce
Update Deployment.md (#8778) 2022-07-15 19:36:03 +08:00
絵空事スピリット bafc32801a
[Doc] Add Deployment (#8770) 2022-07-15 19:16:22 +08:00
mergify[bot] 8f79a8566a
[BugFix] fix unstable ut (#8759) (#8767)
com.starrocks.clone.TabletSchedulerTest#testSubmitBatchTaskIfNotExpired

(cherry picked from commit 9536d49537)

Co-authored-by: padmejin <89557328+padmejin@users.noreply.github.com>
2022-07-15 17:54:05 +08:00
Youngwb d6eaeb05b9
[Cherry-pick][branch-2.3] Fix unknown error when having clause has subquery (#8662) (#8763) 2022-07-15 17:30:41 +08:00
mergify[bot] af951c9762
[BugFix] validate target length when alter json column (#8725) (#8746)
(cherry picked from commit 308acf8c5e)

Co-authored-by: Murphy <96611012+mofeiatwork@users.noreply.github.com>
2022-07-15 16:03:03 +08:00
mergify[bot] b74d44a265
[BugFix] fix the case of dead workgroup (#8729) (#8736)
(cherry picked from commit b8dcc73588)

Co-authored-by: Murphy <96611012+mofeiatwork@users.noreply.github.com>
2022-07-15 16:02:43 +08:00
zihe.liu b370984f7c
[Enhancment] Add MorselsCount and TabletCount to profile of scan operator #8644 (#8681) 2022-07-15 13:57:46 +08:00
meegoo c7e5157611 [CherryPick] Support async olap table sink interface 2022-07-15 10:38:43 +08:00
Murphy c686ec164e
[BugFix] fix NPE of resource group (#8711) 2022-07-15 09:43:27 +08:00
Murphy 43290de111
[Enhance] use max chunk rows to estimate chunk memory (#8706) 2022-07-14 23:04:00 +08:00
mergify[bot] bfdc9434c8
[Enhance] turn on the enable_exchange_pass_through by default (#7906) (#8658)
(cherry picked from commit f65678fca6)

Co-authored-by: Murphy <96611012+mofeiatwork@users.noreply.github.com>
2022-07-14 21:12:29 +08:00
zihe.liu 6ffb863c2c
[Enhancment] Make error message of workgroup more clearly (#8663) (#8677) 2022-07-14 19:01:12 +08:00
絵空事スピリット f22cdeff91
Add Docs to 2.3 (#8669) 2022-07-14 14:13:59 +08:00
mergify[bot] 8cbb042100
[BugFix] Check invalid window argument (backport #8300) (#8641) 2022-07-13 22:37:30 +08:00
mergify[bot] 5ae3aa9768
[BugFix] Fix rate limit of tablet deletion (#8602) (#8654)
If FE found that meta already deleted when handling tablet report(i.e. tabletMeta == null), we should also control the rate limit of tablet deletion task sent to BE.

(cherry picked from commit 5a7288df21)

Co-authored-by: yiming <107105845+nshangyiming@users.noreply.github.com>
2022-07-13 22:34:51 +08:00
Murphy 81e1b6274e
[Feature] display resource group info in explain verbose(backport #8481) (#8631) 2022-07-13 21:04:20 +08:00
satanson 01b6bff02a
fixup avg(distinct non-decimal-type) bugs (#8638)
select avg(distinct non-decimal-type(i.e.BIGINT)) from table gives a invalid multi_distinct_sum signature.
expected: multi_distinct_count[([multi_distinct_count, VARCHAR, false]); args: INT; result: BIGINT
unexpected: multi_distinct_count[([multi_distinct_count, VARCHAR, false]); args: INT; result: DECIMAL128(38,0).
2022-07-13 19:18:28 +08:00
padmejin d39f052173 [BugFix] add read lock to RoutineLoadManager to avoid log out of order (#8295)
1. A user creates a new routine job from a client.
2. The request is handled by a thread, transforming the SQL to a RoutineLoadJob and then put it into a map of the RoutineLoadManager class in a function guarded by a write lock.
3. The RoutineLoadTaskScheduler thread happens to start a new loop just then by getting all the routine load jobs from the map in the RoutineLoadManager class. Since there's no read lock, it finds the newly added job and schedulers it. Then it writes a journal on a state change of this new job.
4. The SQL executor thread is a bit slower in handling the new job. In the end, it writes a journal on creating a new job.
From the follower's view, the journal of creating a job is behind the journal of changing its state. As a result, it failed to replay the former journal because of failure to change the state of a non-existence job.

The solution is simple, just add a read lock would be enough. Since creating a routine job is not a frequently operation, it won't be too much cost for the scheduler thread.

(cherry picked from commit b3a81f021b)
2022-07-13 15:01:45 +08:00
zihe.liu 30be784dd3
[Enhancement] add chunk_accumulate_operator (backport #8535) (#8604) 2022-07-12 23:04:50 +08:00
HangyuanLiu 975d48517b Fix embedded quotation contains backslash bug (#8293) 2022-07-12 22:01:11 +08:00
HangyuanLiu 08b0d6cdd1 Make Parser parsing syntax errors compatible with mysql's error format (#8099) 2022-07-12 22:01:11 +08:00
HangyuanLiu 041ef32381 Fix bug QueryStatement return null when getRedirectStatus (#8087) 2022-07-12 22:01:11 +08:00
HangyuanLiu b5cee15aea Delete old parser used in ConnectProcessor (#7979) 2022-07-12 22:01:11 +08:00
HangyuanLiu 66c73bffc3 Add drop statistic table check (#4587) (#8416) 2022-07-12 22:01:11 +08:00
satanson 90572fc41d
rectify wildcard decimal types of multi_distinct_sum converted from sum(distinct) and avg(distinct) (#8425) (#8571)
In optimize-phase, sum(distinct c) is converted into multi_distinct_sum(c), and avg(distinct c) is converted into multi_distinct_sum(c) / multi_distinct_count(c), the resloved function signature of decimal types carries wildcard decimal, which is illegal in BE, so we must rectify wildcard decimal types by replace it with real decimal types as we do in analyze-phase.
2022-07-12 15:34:52 +08:00
padmejin ad7bf9f583 [BugFix] Late recycle if repair tablet from recycle bin (#8254)
Submitting repair tasks only if no expired db/table/partition involved to avoid NPE when tablet is added after its db/table/partition is dropped.
2022-07-12 13:16:34 +08:00
mergify[bot] 312bbc32fe
[Enhancement] add log configuration for the jars called via JNI in BE (backport #8104) (#8563) 2022-07-12 10:17:21 +08:00
mergify[bot] 0571c2c42d
[Bugfix] Add hive external table counter (#8484) (#8529)
(cherry picked from commit 0c9573a552)

Co-authored-by: stephen <91597003+stephen-shelby@users.noreply.github.com>
2022-07-11 16:14:04 +08:00
mergify[bot] e1ca6cd66b
[Bugfix] fix hive column stats cache bug (#8455) (#8530)
(cherry picked from commit 7393e8b77a)

Co-authored-by: stephen <91597003+stephen-shelby@users.noreply.github.com>
2022-07-11 16:13:53 +08:00
mergify[bot] f1c783108b
[BugFix] Fix storage cooldown time ut (#8486) (#8502) 2022-07-10 08:07:33 +08:00
zihe.liu 1c373141bc
[Bugfix] release request memory before instance_mem_tracker desctructed (backport #8473) (#8496) 2022-07-09 22:59:04 +08:00
zihe.liu e3aa62c460
[Enhancement] use memchr instead of SIMD::count_nonzero to calculate has_null (#8450) (#8474) 2022-07-09 16:59:07 +08:00
Murphy 33aa4b003a
[Enhance] allow simple limit sql exceed bigquery_scan_rows_limit (backport #8380) (#8460) 2022-07-09 15:19:53 +08:00
kangkaisen d0aff53f0f
Remove meaningless DCHECK for create_varchar_type (#8107) (#8462) 2022-07-09 11:04:30 +08:00
zihe.liu 9a462aff9a
[Enhancement] limit buffer capacity instead of scan concurrency (#8427) 2022-07-09 11:03:30 +08:00
mergify[bot] de4c12a914
[Feature] set global_runtime_filter_build_max_size to 0 to force use filter (backport #8445) (#8448) 2022-07-09 10:26:10 +08:00
padmejin dada85ae2f [BugFix] Skip check partition when change kafka offset if not initialized (#8290)
(cherry picked from commit 99e5ff8d80)
2022-07-09 09:48:33 +08:00
mergify[bot] b22be30145
[BugFix] raise exception if image download fails (#8111) (#8440)
Fix these 2 cases.
1. No exception will throw on loading an empty image file.
2. No exception will throw when failing to download an image from the helper. For example, if the image dir is readable to
     the FE process but the image file is not.

(cherry picked from commit 8d5e750107)

Co-authored-by: padmejin <89557328+padmejin@users.noreply.github.com>
2022-07-08 22:35:59 +08:00
yan.zhang 9636116452
move `scan_operator->prepare` operation into IO thread (#8423) 2022-07-08 17:38:23 +08:00
zihe.liu 70f7ac1e79
[Bugfix] use correct mem_tracker for exchange/result sink (backport #8402) (#8414) 2022-07-08 14:26:06 +08:00
gengjun-git a7f7c627cb [Enhancement] Add some cleaner config param to bdb #7993
Add bdbje_cleaner_threads and bdbje_replay_cost_percent to speed the clean. If the bdb dir expands, set bdbje_cleaner_threads to a higher value (4 for example) and set bdbje_replay_cost_percent to 0.

(cherry picked from commit 1793478e2a)
2022-07-08 14:09:46 +08:00
gengjun-git 1f8d9e66f6 [BugFix] Fix bug that thrift server daemon exits quietly (#7974)
Rewrite the execute() function of org.apache.thrift.server.TThreadPoolServer: We do not kill the server for any exception thrown by ExecutorService, just close the connection and print the error log. But TThreadPoolServer will kill the server, and any new connection will not be processed, it will cause the connections saturate the tcp backlog.

Because there are some private properties in TThreadPoolServer. So we create a new implementation of TServer, and copy the code of TThreadPoolServer, and just modify the error process of the execute function.

(cherry picked from commit 3e49ab438f)
2022-07-08 14:09:08 +08:00
zhangqiang bd22b41b2c [Enhancement] Retry building ImmutableIndexShard if move_bucket failed (#8305)
Currently, move_bucket may fail even though we find a move solution, and building ImmutableIndex will fail if a shard is very imbalanced. We need to prevent building index failure, when move_bucket failed, we can increase the number of pages and retry, until it works.
2022-07-07 19:29:49 +08:00
mergify[bot] bc03bba2cb
[Enhancement] choose src replica with bigger version first when cloning (#8367) (#8382) 2022-07-07 12:40:52 +08:00
zhangqiang 403b87bb98
[Cherry-pick][Branch-2.3][BugFix] Fix some bug of branch-2.3 (#8353)
* BugFix: Fix error update of PersistentIndexMeta (#8288)

* BugFix: Fix potential inconsistency between persistent index file and PersistentIndexMeta (#8286)

The enable_persistent_index of tablet meta can be changed by alter, we can't guarantee that the enable_persistent_index will not be changed during the apply process.

If we change enable_persistent_index during apply, it may cause inconsistency between persistent index file and PersistentIndexMeta.
2022-07-07 11:23:22 +08:00
mergify[bot] 09a260b215
change exception type in IntLiteral constructor so that LargeIntLiteral can handle error message correctly (#7281) (#8376) 2022-07-06 23:04:35 +08:00
mergify[bot] d816ea7f73
[BugFix] bug in Array column's `byte_size` (#8308) (#8336)
_elements->byte_size 's input variables are (from, from + size), it should be (from, size)

(cherry picked from commit cad8f445a1)

Co-authored-by: xyz <a997647204@gmail.com>
2022-07-06 19:07:13 +08:00
mergify[bot] 8ab2bf2f25
[Enhance] add session variables in profile (backport #8156) (#8322) 2022-07-06 19:04:51 +08:00
mergify[bot] 6616f70e88
[Bugfix] disable lowcardinality optimize in join conjuncts (#8303) (#8349)
(cherry picked from commit 9322dc5423)

Co-authored-by: stdpain <34912776+stdpain@users.noreply.github.com>
2022-07-06 18:47:08 +08:00
liuyehcf a97acaf1ea
[BugFix] Fix the problem of identical analytic expressions (#8268) (#8333) 2022-07-06 13:56:23 +08:00
mergify[bot] d8f726de26
[BugFix] heap-use-after-free in ExternalScanContextMgr (#8280) (#8299) 2022-07-05 23:30:05 +08:00
zhangqiang d0e828a144
[BugFix] Fix be crash because of NPE (backport #8242) (#8263)
When _chunk_pool is empty and global_status is not OK, be will crash because of accessing null pointer.
2022-07-05 22:34:04 +08:00
Youngwb 4acda6f4ea [BugFix] Derive group logical property after prune columns when need optimize CTE (#8289)
(cherry picked from commit 68cad8188c)
2022-07-05 21:06:39 +08:00
HangyuanLiu 6ce0ec81c5
[Cherry-pick branch-2.3] InsertTxnCommitAttachment read and write meta use GSON (#8210) (#8247) 2022-07-05 21:02:54 +08:00
nshangyiming cd7188c23c [cherry-pick][BugFix] FORCE_REDUNDANT should also check NEED_FURTHER_REPAIR (#7844)
FORCE_REDUNDANT should also check NEED_FURTHER_REPAIR before dropping a replica,
or else the newly cloned replica with stale version could be
dropped(because the loading process continue updating tablet).

(cherry picked from 80eba32ed4)
2022-07-05 20:14:40 +08:00
mergify[bot] 7ef2250fb1
[BugFix] Fix nullable column update rows not set _has_null (#8139) (#8285)
(cherry picked from commit 64a9bb154e)

Co-authored-by: meegoo <hujie-dlut@qq.com>
2022-07-05 19:40:04 +08:00
mergify[bot] 7eeb47d56e
[BugFix] fixup negative from days (#7275) (#8281)
(cherry picked from commit fe451f0a4f)

Co-authored-by: satanson <ranpanf@gmail.com>
2022-07-05 17:27:38 +08:00
yiming 5c3bd5cc6d [BugFix] clean ghost tablets on BE when handling tablet report (#7989)
We need to clean these ghost tablets from the current backend, or else it will continue to report them to FE forever and add some processing overhead(the tablet report process is protected with database lock).

(cherry picked from commit 6fa1363b37)
2022-07-05 12:37:05 +08:00
nshangyiming c4da8bf78f [Enhance] Avoid meaningless tablet repair scheduling (#7660)
(cherry picked from commit 4781fda428)
2022-07-05 12:36:29 +08:00
yiming 5ca7fe82d6 [BugFix] fix replica meta falsely deleted right after clone finished (#7905)
The main reason causes this problem is that FE handles tablet report tasks and clone
tasks concurrently, with specific timing, this will happen. So we add a new state
`deferReplicaDeleteToNextReport` for `Replica`, default to true. When FE meets a replica
that only existed in FE's meta, not in the tablet report, it will check
`deferReplicaDeleteToNextReport` and defer the meta delete till the next report of the BE.
In the following situation, a normally cloned replica could be falsely deleted:
1. BE X generates a tablet report and sends it to FE
2. FE creates a clone task(for balance or repair) and a new replica on BE X,
     so the corresponding tablet is not included in the report sent above.
3. BE X finishes the clone and then FE will receive the message and set the state
     of the new replica to NORMAL
4. FE processes the tablet report from step 1 and finds that BE X doesn't report
    the tablet info corresponding to the newly created replica, so it will delete
    the replica from its meta
5. On the next tablet report of BE X which will include the tablet info, FE finds
     that the tablet reported by BE X doesn't exist in its meta, so it will send a
     a request asking BE X to delete the newly cloned replica physically.

(cherry picked from commit 03070130c8)
2022-07-05 12:36:14 +08:00
Binglin Chang d1ddf1f470
[Enhancement] Remove warn log in Replica.updateReplicaInfo (#7824) (#8218) 2022-07-04 17:01:35 +08:00
rickif e14245a7b3
[BugFix] Chunked json for txn stream load (#8046)
This PR adds the support of chunked json for transaction stream load.
The total size of transaction would not be limited, but the data size of the single write in the transaction would be limited to 4GB.
2022-07-04 10:59:34 +08:00
stdpain f2a18e19a7
[Bugfix] Fix function signature error in array_to_bitmap (#7404) (#8157)
fix function signature for array_to_bitmap
2022-07-02 13:27:57 +08:00
xyz e61902f454
[BugFix] enable-check-string-length (backport #8095) #8167 2022-07-02 10:35:52 +08:00
Murphy 2e285d6ddd
[Refactor] refactor and fix the scan counters (backport #8088) (#8172) 2022-07-01 20:38:47 +08:00
zihe.liu d425214b9d
[Enhancment] Rename data_floor function to time_slice (#6951 #7033) (#8170)
* Rename data_floor function to time_slice (#6951)

* [Bug Fix] Fix TimestampArithmeticExpr analyze bug (#7033)

Co-authored-by: kangkaisen <kangkaisen@apache.org>
2022-07-01 19:41:05 +08:00
zhangqiang a10deee007
BugFix: remove PersistentIndexMeta in rocksdb (#8161) (#8171)
When we drop a primary key table using persistent index, we don't clear the persistent index meta in rocksdb.
2022-07-01 19:20:31 +08:00
HangyuanLiu dc32d820a0
Add InsertTxnCommitAttachment to support version rollback (#8145) 2022-07-01 19:16:27 +08:00
zhangqiang 657886b5bd [Bugfix] fix unstable be ut (#8082)
The reason is that when persistent index is enabled for the primary key table, the apply process will take more time. So there may be rowset has been committed but not finished apply when the compaction task is submitted, which will cause the subsequent check fail.
2022-07-01 16:49:53 +08:00
mergify[bot] b89ba7c9b1
[BugFix] parse json when load a parquet string (backport #8110) (#8126)
(cherry picked from commit 7e555b8997)

Co-authored-by: Murphy <96611012+mofeiatwork@users.noreply.github.com>
2022-06-30 22:02:22 +08:00
meegoo 95bb0b4a89
[BugFix] fix counter set in close before initialize (#8074) (#8109) 2022-06-30 21:23:01 +08:00
waittting 7cf70c3fde
Remove the helpers from bdb ReplicationGroupAdmin when dropped fe (#6773) (#7999)
Remove the helpers from bdb ReplicationGroupAdmin when dropped fe
2022-06-30 19:53:59 +08:00
stephen a1e060574c
Revert "hive external unsupport binary type (#8025) (#8026)" (#8084)
This reverts commit 0b501b12c3.
2022-06-30 17:08:30 +08:00
mergify[bot] 0166260004
[Enhance] avoid put too much chunk in ChunkSource::chunk_buffer (backport #8051) (#8069) 2022-06-30 12:38:01 +08:00
mergify[bot] 88af5d5b81
[Feature] support scan_rows limitation for external table (backport #8035) (#8044) 2022-06-30 09:51:27 +08:00
mergify[bot] 5969647c9d
[BugFix] Fix return non-existent codes in global dictionary optimization (#8034) (#8055)
After these change, when BE returns to a dictionary of size 256, FE will not use this dictionary.

(cherry picked from commit 52bfe631cf)

Co-authored-by: stdpain <34912776+stdpain@users.noreply.github.com>
2022-06-29 23:39:54 +08:00
zhangqiang e9bc0d71c5 BugFix: fix data lost after compaction using persistent index (#8002)
When PersistentIndex is turned on, we use phmap as l0 to save kv pairs.
However, the kv pair storage may be discontinuous because phmap is aligned, which causes the wrong data to be written during flush. So the subsequent compaction may fail.
We will use uint8_t[8] instead of uint64_t to avoid the alignment of phmap and save memory usage.
2022-06-29 19:11:33 +08:00
mergify[bot] 86f40d22af
[BugFix] fix bunch of bugs of resource group (backport #7933) (#8015) 2022-06-29 17:02:51 +08:00
Binglin Chang 99a2060ada
[Bugfix] Ignore EAGAIN error for futex wait (#7779) (#8020) 2022-06-29 16:49:50 +08:00
mergify[bot] 0b501b12c3
hive external unsupport binary type (#8025) (#8026)
(cherry picked from commit bad5ed303e)

Co-authored-by: stephen <91597003+stephen-shelby@users.noreply.github.com>
2022-06-29 16:48:22 +08:00
lichaoyong 3a3cbf3687 [BugFix] Fix the wrong parameter of log in Substitute (#7880) 2022-06-29 14:42:23 +08:00
lichaoyong 4e0e9ae653 [BugFix] Fix the data race of event_bases and event_https (#7798) 2022-06-29 14:42:23 +08:00
lichaoyong fbcaeeac65 [BugFix] Fix the HttpServer not wait the threads finished. (#6784)
Upon exit gracefully, the HttpServer should guarantee all the threads
are finished. Otherwise, the memory leak will be thrown. The pull request fixes the bug.
2022-06-29 14:42:23 +08:00
mergify[bot] 4f85314821
[BugFix] fix resource group metrics (backport #6953) (#8006) 2022-06-29 13:55:50 +08:00
zhangqiang 85cab585c2
[Enhancement]Optimize `find_buckets_to_move` to make sure find a move solution (#7685) (#8004)
Currently, find_buckets_to_move uses a brute force search algorithm to search buckets to move, only searching 1/2/3 of bucket movements which may find a solution.
2022-06-29 12:48:21 +08:00
Binglin Chang d8f8c55d87
[Bugfix] Should check cast expr validity in update statement (#7885) (#7988) 2022-06-28 21:13:43 +08:00
zhangqiang 38b959246f
[BugFix]: Change `enable_persistent_index` of primary table may cause be crash in ASAN mode (#7762) (#7926)
This is caused by concurrency between apply and modify enable_persistent_index. The reference of primary index may not equal 1, so it will cause crash when call function remove in ASAN mode.
2022-06-28 14:53:30 +08:00
mergify[bot] fc2f26f443
[Bugfix] fix convert hive char type to sr type (#7912) (#7930)
Hive	StarRocks
string	varchar(65533)
varchar(len)	varchar(len)
char(len)	char(len)
binary	varchar(65533)

(cherry picked from commit 4eacc80194)

Co-authored-by: stephen <91597003+stephen-shelby@users.noreply.github.com>
2022-06-28 14:44:40 +08:00
mergify[bot] 1bd500f014
[BugFix] try to parse it when casting string to json (backport #7835) (#7863) 2022-06-28 13:34:21 +08:00
mergify[bot] f721c4aac6
[BugFix] Routine load job running time (#7909) (#7922)
(cherry picked from commit b0523a3fc6)

Co-authored-by: rickif <rickif@qq.com>
2022-06-28 11:03:13 +08:00
yan.zhang 5805762ea6
Fix invalid ref to counter in HdfsParquetScanner (#7919) 2022-06-28 09:43:45 +08:00
mergify[bot] 44f271dd2b
[BugFix] ReorderJoinRule use error preconditions (#7831) (#7895)
(cherry picked from commit 0c18f3a294)

Co-authored-by: Youngwb <yangwenbo_mailbox@163.com>
2022-06-27 19:13:10 +08:00
mergify[bot] 34eaaa63e1
[Bug]Fixed the inability to recognize non-UTF-8 encoded strings when collecting dictionary information (#7795) (#7875)
(cherry picked from commit 8cf944be88)

Co-authored-by: stdpain <34912776+stdpain@users.noreply.github.com>
2022-06-27 19:06:21 +08:00
mergify[bot] 3ad411c66c
[BugFix] Fix empty result set (#7793) (#7888)
The chunk ck might be empty if the entire file is skipped

(cherry picked from commit 3979ebbe87)

Co-authored-by: dorianzheng <xingzhengde72@gmail.com>
2022-06-27 16:58:39 +08:00
zihe.liu d8817f1476
[Bugfix] Prevent JVM from handling signals (#6929) (#7841)
When the signal SIGINT or SIGTERM coming, the process BE captures it and performs graceful exit operations by order. For example, when a SIGTERM coming,

The process captures SIGTERM to make start_be() can be returned.
The thread TaskWorkerPool, which uses StarRocksMetrics::instance, is desctructed, when the function start_be() returns.
__run_exit_handlers desctruct all the static variables including StarRocksMetrics::instance.
However, JVM overwrites the handlers of SIGINT and SIGTERM. Then, the sequential graceful exit operations are breaked. start_be() has no chance to return, so StarRocksMetrics::instance will be desctructed while The thread TaskWorkerPool is running.
2022-06-27 11:16:56 +08:00
satanson 89905cba06
fixup bug introduced by PR-7751 (#7832) (#7847) 2022-06-25 20:55:30 +08:00
Youngwb f4831d60e8
[Cherry-pick][branch-2.3][Enhancement] Support multi count distinct with different multi cloumns (#7850) 2022-06-25 19:48:12 +08:00
mergify[bot] 45e99569d2
[BugFix] Improve performance ofselection intersection in runtime-filter exploration-phase. (#7751) (#7816)
(cherry picked from commit 3df468cc2b)

Co-authored-by: satanson <ranpanf@gmail.com>
2022-06-25 17:30:56 +08:00
mergify[bot] b883978af5
add SHUFFLE_HASH_BUCKET to isLocalApplicable (#7750) (#7815)
This issue is analyzed in https://starrocks.feishu.cn/docx/doxcnrOAE5F8G4RzBJzFkBhvt9b

```
+====================+========+===================+================+
| query              | 2.2.1  | 2.2.2(regression) | this PR(2.2.2) |
+====================+========+===================+================+
| hive_tpcds.query95 | 1.260s | 4.330s            | 1.283s         |
+--------------------+--------+-------------------+----------------+
```

(cherry picked from commit 65a1ee9906)

Co-authored-by: satanson <ranpanf@gmail.com>
2022-06-25 17:29:30 +08:00
Seaven d8c635c1b2 [BugFix] CTE rewrite count distinct override mv rewrite (#7702) 2022-06-25 16:52:36 +08:00
Youngwb 2d4dc463b3 [BugFix] Fix count distinct constant unknown error when use cte (#7646) 2022-06-25 16:52:36 +08:00
Youngwb 8d4dfd06cd [Enhancement] Optimize avg(distinct column) with CTE (#7264) 2022-06-25 16:52:36 +08:00
Seaven 66ff481666 [BugFix] Fix cte inline bug (#7662) 2022-06-25 16:52:36 +08:00
gengjun-git 648bbc518f
[Cherry-Pick-2.3][BugFix] Fix load jobs hang with error: current running txns on db xxx is 100, larger than limit 100 (#7569)
After transferring the master, the master address recorded in be is still the address of the old master(the time before it reaches the new master's heartbeat). The txnCommit rpc executed on non-master fe will cause some metadata inconsistency issues (described in #7350). So we should reject those request if current node is not master.
2022-06-25 12:50:57 +08:00
zhangqiang 01712c720b [BugFix]: compaction error when enable persistent index (#7654)
In function PersistentIndex::try_replace(), new value is not update if find same key in hash map. So it will cause data error in compaction of the primary key table.
2022-06-25 09:44:05 +08:00
sduzh f2e71258df [BugFix] Fixed invalid init of ExecutionQueueId and std::mutex in bthread
Conflicts:
	be/src/runtime/load_channel.cpp
2022-06-25 09:28:54 +08:00
rickif d0840cc858 [BugFix] wrong result of load with condition `IS NULL`/`IS NOT NULL` (#7748)
Fix result of load with condition IS NULL/IS NOT NULL.

(cherry picked from commit e7abc9ed8e)
2022-06-25 09:26:39 +08:00
Murphy 46845282ad
[Feature] support extract fields from parquet struct (backport #7655) (#7810) 2022-06-25 09:20:12 +08:00
Seaven af27b0c7dc
[Enhancement] Optimizer plan timeout (backport #7542) (#7759) 2022-06-24 23:58:18 +08:00
mergify[bot] dfd7f38629
[BugFix] Fix alter routine load NPE (backport #7805) 2022-06-24 23:56:13 +08:00
Napoleon 9887c2389a [Enhancement] Support column groups for final merge (#7333)
It will take up a lot of memory when merging temporary segment files in the final merge procedure, which may cause OOM. Support column groups when merging segment files.
There does exist some repetitive code, but other unrelated code will be influenced if refactor now, so let them go and refactor it in the next individual PR. It will reduce memory 13~16 times and latency by 35%. A case like this one(5000 columns, 4.3G, 100000 rows), other data scales are the almost same ratio
2022-06-24 22:08:53 +08:00
mergify[bot] 05e60c2c83
[Feature] Add unknown_catalog_and_db ERRORCODE (#7797) (#7800)
error catalog in query -> ERR_BAD_CATALOG_ERROR
error catalog.db in use statement -> ERR_BAD_CATALOG_AND_DB_ERROR
2022-06-24 21:37:39 +08:00
mergify[bot] 5bff38c262
[BugFix] Fix broker exception the fd is not owned by client null (#7773) (#7781)
Broker has control plane heartbeat service. It may be a timeout when the pressure is high. Then it will be removed from clientContexts and make the data plane throw FD is not owned by client null exception.
So that we update the heartbeat timestamp in both the control plane & data plane to avoid this problem.

(cherry picked from commit 6c56ddc508)

Co-authored-by: meegoo <hujie-dlut@qq.com>
2022-06-24 19:45:19 +08:00
trueeyu 98da56f1e6 [Enhancement] move the destructor of TabletsChannel out of lock (#7753)
We will bthread::execution_queue_join() in the destructor of AsnycDeltaWriter, the function will block the bthread, so we put its destructor of TabletsChannel outside the lock.
2022-06-24 19:12:19 +08:00
trueeyu 05ce6b0738
[BugFix] fix the bug of get_tablet with schema hash (#7736) (#7746)
our usage covert `schema hash` to `include_deleted`

which was introduced the the pr: 3fd83add0a
2022-06-24 16:34:35 +08:00
Binglin Chang 332a886afa
[Bugfix] Check if tablet deleted after acquiring tabletupdates' lock (#7304) (#7731)
When deleting a primary key tablet, clear_meta will be called and _edit_version_infos will be cleared, this makes all other concurrent operations invalid, so these operations running in other threads should check state validity after acquiring lock.
2022-06-24 11:00:03 +08:00
mergify[bot] 5aa2f0ac00
[BugFix] use new parser in meta replay & restore (backport #7481) (#7700) 2022-06-24 09:53:32 +08:00
mergify[bot] 90a1b25ed8
[Bugfix] Throw exception when failed to get databases (#7710) (#7728)
(cherry picked from commit 747af2b726)

Co-authored-by: stephen <91597003+stephen-shelby@users.noreply.github.com>
2022-06-23 23:37:56 +08:00
mergify[bot] a18d59d70c
[refactor] Adjust drop external catalog syntax (#7711) (#7723)
drop external catalog catalog_name -> drop catalog catalog_name

(cherry picked from commit ebf2110d17)

Co-authored-by: stephen <91597003+stephen-shelby@users.noreply.github.com>
2022-06-23 22:33:48 +08:00
mergify[bot] 63c1346a0c
[Bugfix] fix init database failed in QueryDetail when executing multiple statements (#7715)
When we send multiple external_catalog's queries by jdbc. it will failed when init database in QueryDetail.
2022-06-23 20:36:43 +08:00
mergify[bot] 41426cd2fe
[BugFix] make hdfs scan use individual WorkgroupOwner (backport #6864) (#7699) 2022-06-23 17:06:47 +08:00
stdpain a2f3714225 [Bugfix] keep the aggregate expr order when rewrite aggregate operator (#7657)
we should keep the aggregate expr order when rewrite, because we will use
singleDistinctFunctionPos to decide call update or merge

(cherry picked from commit 3fadd41275)
2022-06-23 15:59:46 +08:00
HangyuanLiu d2306ea38a [BugFix] Fix parse comment incompatible bug (#7641) 2022-06-23 15:05:41 +08:00
HangyuanLiu 8c456a89a4 [BugFix] Fix parse comment incompatible bug (#7508) 2022-06-23 15:05:41 +08:00
HangyuanLiu cbcd34cc50 [BugFix] Fix set var select hint parse error bug #7540 2022-06-23 15:05:41 +08:00
zhuxt2015 593aca2d66 [Feature] Support create table in new Parser and new Analyzer (#6102)
[Feature] Support create table in new Parser and new Analyzer (#6102)
2022-06-23 15:05:41 +08:00
mergify[bot] 26f88408c0
[BugFix] trim quote in get_json_string (backport #7472) (#7670) 2022-06-23 11:08:37 +08:00
mergify[bot] 5e2764485f
[BugFix] fix nullable column has_null check (backport #7617) (#7668) 2022-06-23 09:50:14 +08:00
mergify[bot] 7c02ffe078
[Bugfix] fix BE crash when destroy pass-through buffer (backport #7623) (#7631) 2022-06-22 23:41:07 +08:00
mergify[bot] 41f0e3f69f
[BugFix] allow set an empty resource_group through variable (backport #7633) (#7663) 2022-06-22 22:53:46 +08:00
Binglin Chang 2eba1449e2 [Enhancement] Check total row count consistency after compaction (#7287)
Check total row count consistency after compaction, so bugs can be detected early.
2022-06-22 19:04:37 +08:00
stephen 0fe06312b3
[Feature] UseStmt support catalog.database (#7585) (#7636)
'use xxx' from mysql client will be handle by COM_INIT_DB protocol in fe, it won't generate UseStmt.
'use xxx' from jdbc will be handle by COM_QUERY, we need adapt UseStmt to 'use catalog.db'.
this pr port UseStmt from old parser and analyzer to new parser and analyzer.
2022-06-22 17:22:02 +08:00
mergify[bot] 01b9054325
[Feature] Support any_value function for JSON type (backport #7560) (#7598) 2022-06-22 11:24:36 +08:00
zhangqiang 0960b1733d BugFix: data error in build persistent index from tablet (#7555) 2022-06-22 09:49:34 +08:00
zhangqiang 822bfe30bb [BugFix] Potential data errors in primary tables using persistent indexes (#7417)
In function PrimaryIndex::_insert_into_persistent_index(), if values in rowids is not continuous, we will get error value after call function PrimaryIndex::_build_persistent_values which may cause data error of primary table using persistent index.
2022-06-22 09:49:34 +08:00
trueeyu 65c4f37b10 Fix the bug of create inital rowset (#7553) 2022-06-22 09:45:34 +08:00
mergify[bot] 48df20f7f7
[BugFix] rename JoinRuntimeFilterTime to ConjunctsTime (#7537) (#7579)
(cherry picked from commit 2dedd98515)

Co-authored-by: liuyehcf <1559500551@qq.com>
2022-06-21 21:15:08 +08:00
mergify[bot] 6255dff9ff
[Bugfix] Fix bug task process database (#7546) (#7591)
1.ConnectContext should use task set database
2.database should not use getDb because db may drop
3.do not should default cluster to user, only show db name

(cherry picked from commit 43695ebcb6)

Co-authored-by: xueyan.li <astralidea@163.com>
2022-06-21 21:11:35 +08:00
mergify[bot] 0a81367a25
Routine load failed when fe restart (#7567) (#7587)
(cherry picked from commit 2c4f754c55)

Co-authored-by: qinmengna <86873587+goodqiang@users.noreply.github.com>
2022-06-21 20:33:19 +08:00
mergify[bot] 4e83cb1fb1
[BugFix] Fix the wrong type check of iceberg/hudi table (#7504) (#7531)
Fix the wrong type check of iceberg/hudi table

The BE will crash if the integer type field of icebreg/hudi table is defined as tinyint/smallint in starrocks external table when complie type is DEBUG or ASAN

(cherry picked from commit 887832b381)

Co-authored-by: miomiocat <284487410@qq.com>
2022-06-21 17:46:52 +08:00
trueeyu 32daac249e [Enhancement] Don't output the unused log of RE2 (#7528)
```
      if (!prog_->SearchDFA(subtext, text, anchor, kind,
                            matchp, &dfa_failed, NULL)) {
        if (dfa_failed) {
          if (options_.log_errors())
            LOG(ERROR) << "DFA out of memory: size " << prog_->size() << ", "
                       << "bytemap range " << prog_->bytemap_range() << ", "
                       << "list count " << prog_->list_count();
          // Fall back to NFA below.
          skipped_test = true;
          break;
        }
        return false;
```

If search failed with `DFA`, we will use the `NFA` algorithm, so the log may be useless for us.
2022-06-21 17:37:20 +08:00
dorianzheng 891d5c49e5
[BugFix] Fix wrong column order (#7482) (#7554)
Since DataStreamRecvr::SenderQueue::_build_chunk_meta is only called for the first chunk it receive, so the subsequent chunk should remain the same column order as the first chunk, otherwise, the deserialize might get wrong type from chunk meta.
2022-06-21 16:41:27 +08:00
Napoleon 7e44ad87ca
[BugFix] Ignore update when no load data(#7430) (#7467) 2022-06-21 16:05:54 +08:00
mergify[bot] abcb0baf25
Revert "[BugFix]Fix wrong column order (#7413)" (#7440) (#7539)
(cherry picked from commit a50e52d7ce)

Co-authored-by: kangkaisen <kangkaisen@apache.org>
2022-06-21 15:38:21 +08:00
HangyuanLiu e0d20cf4dc [BugFix] Fix FieldReference equals has NPE bug (#7194)
[BugFix] Fix FieldReference equals has NPE bug (#7194)
2022-06-21 14:42:47 +08:00
HangyuanLiu 08417aa3c5 Modify sql_mode default value to only_full_group_by in version 2.3 (#7084) 2022-06-21 14:42:47 +08:00
HangyuanLiu 5124944c60 Fix bug getAssigmentCompatibleTypeOfDecimalV3 miss time type (#7124) 2022-06-21 14:42:47 +08:00
HangyuanLiu ede727db66 Fix sql_full_groupby mode mistake rewrite column in aggregation function bug (#7049) 2022-06-21 14:42:47 +08:00
HangyuanLiu a0953287e5 [Enhancement] Support parens relation syntax in new Parser #6894 2022-06-21 14:42:47 +08:00
HangyuanLiu 7ab4d8ab9a Remove `default_cluster` from ViewDefBuilder (#6907) 2022-06-21 14:42:47 +08:00
mergify[bot] 28ed52dd0e
[Bugfix] Fix bug defineExpr is not set use old Analyzer (#7403) (#7535)
this because #6459 remove old analyze method but defineExpr need use old Analyzer to generated.

(cherry picked from commit 0e13d2ea0d)

Co-authored-by: xueyan.li <astralidea@163.com>
2022-06-21 14:13:23 +08:00
zhangqiang 9ad21f1318 BugFix: Fe restart failed (#7433) 2022-06-21 12:47:38 +08:00
zhangqiang d26a98cccd [BugFix]: Load txn commit failed while primary table using persistent index (#7450)
If the primary table using persistent index, we will create a `index.l0.x.x` to save the snapshot or log. However, after the first data load is successful, the second data load may fail. 

The reason is when we dump snapshot of the `hash_map` in memory, we use function `dump_bound()` to evaluate the snapshot file size. But when the map is empty, the return value of function `dump_bound()`  will be set as larger than sizeof(uint64_t) but the snapshot file only write `size_(uint64_t)`. It will cause the meta data error which leads to deserialize `index.l0.x.x` failed.

So the return value of function `dump_bound` is set to `sizeof(size_t)` when the `_map` is empty
2022-06-21 12:47:38 +08:00
xueyan.li 3c8cd6e813
[Cherry-pick] Optimize task observability and usability (#7514)
add expire_time field for task/taskrun improve observability
add config task_check_interval_second instead of label_clean_interval_second
fix bug about queue size
2022-06-21 09:46:46 +08:00
mergify[bot] 59d25c9061
[BugFix] Malformed packet error when Result Package greater than 16M (backport #6843) (#7239) 2022-06-20 22:52:36 +08:00
Youngwb 0b93919d59
[BugFix] Remove wrong preconditions in join reorder (#7099) (#7497) 2022-06-20 18:37:43 +08:00
xyz f2708fae49 [BugFix] choose wrong compaction algorithm in some case (#7225)
When performing base compaction on the tablet with the following rowsets:

   "rowsets": [
        "[0-175183] 11 DATA NONOVERLAPPING",
        "[175184-175232] 0 DATA NONOVERLAPPING",
        "[175233-175278] 0 DATA NONOVERLAPPING",
        "[175279-175327] 0 DATA NONOVERLAPPING",
        "[175328-175369] 0 DATA NONOVERLAPPING"
    ]

Because there is 11 segments and columns num are more than 5, it will choose the vertical compaction.
When creating a tablet reader, it will merge 11 segment iterators into one union iterator, and create a HeapMaskIterator with only one union_iterator.
In the `new_heap_merge_iterator`, it will return the iterator immediately if the input iterators length is 1.
So here we will then call the get_next(Chunk* chunk, std::vector<RowSourceMask>* source_masks) on the union_iterator,
However, the union_iterator doesn't implement get_next(Chunk* chunk, std::vector<RowSourceMask>* source_masks), and
we get the warning `get chunk with sources not supported`.

In fact, we should count the number of segment iterator after creating a tablet reader, only one union iterator in this case,
and choose the horizontal compaction.

(cherry picked from commit cc75587f46)
2022-06-20 13:30:45 +08:00
mergify[bot] 77031a90dd
Global rf should not take place of local rf when all the runtime-filters' selectivity >0.01 and < 0.5 (#7420) (#7470)
(cherry picked from commit 17b7227017)

Co-authored-by: satanson <ranpanf@gmail.com>
2022-06-20 13:24:43 +08:00
mergify[bot] b202d93601
[BugFix]Fix wrong column order (#7413) (#7436)
Since DataStreamRecvr::SenderQueue::_build_chunk_meta is only called for the first chunk it receive, so the subsequent chunk should remain the same column order as the first chunk, otherwise, the deserialize might get wrong type from chunk meta.
2022-06-17 23:15:21 +08:00
mergify[bot] 61c7817da9
[BugFix] fix window_funnel udf lost condition (#6812) (#7400)
(cherry picked from commit e14f411523)

Co-authored-by: hongli.chang <honglichang@tencent.com>
2022-06-17 22:36:25 +08:00
mergify[bot] dba5aa6d18
[BugFix] Fix memory leak when ReusableClosure delete itself (#7138) (#7150)
In the ReusableClosure destructor function, it calls brpc::Join(). It will be deadlock and block the bthread, resulting in a memory leak.

(cherry picked from commit 1adfb16a87)

Co-authored-by: meegoo <hujie-dlut@qq.com>
2022-06-17 21:27:30 +08:00
mergify[bot] 62e7496bab
Fix bug insert failed when query empty (#7376) (#7429)
insert should not report an error in this scenario

(cherry picked from commit 4fdb62ee6c)

Co-authored-by: xueyan.li <astralidea@163.com>
2022-06-17 20:02:56 +08:00
mergify[bot] 1ca1830f77
[Enhance] Consider key container memory usages for streaming aggregate (#7402) (#7411)
(cherry picked from commit 681e783a12)

Co-authored-by: zihe.liu <ziheliu1024@gmail.com>
2022-06-17 19:37:48 +08:00
mergify[bot] edaaaad694
Fix bug in some case failed to set DefineExpr for materialized view. (#7371) (#7423)
If the user's table contains uppercase columns, then defineExpr may be processed in two cases when building a materialized view. One is that Log is processed through MVColumn, which is no problem. The second type of Image is directly compared to Column. This processing is problematic, which will cause defineExpr to be set to null, which will affect import and query.

(cherry picked from commit cd58001e60)

Co-authored-by: xueyan.li <astralidea@163.com>
2022-06-17 17:53:58 +08:00
sevev f3d194a481 [BugFix]: Use PrimaryIndex in memory when key columns contains variable length column 2022-06-17 16:56:18 +08:00
zhangqiang 3b9f7922db BugFix: Unanticipated errors from primary table with enable_persistent_index (#7342) 2022-06-17 16:56:18 +08:00
mergify[bot] 472fc67fcf
disable checking tablets in fe plan test (#7398) (#7406)
(cherry picked from commit 3652a2febb)

Co-authored-by: eyes_on_me <3675229+silverbullet233@users.noreply.github.com>
2022-06-17 15:38:16 +08:00
Murphy c78fc02adf
fix: skip fill_null_with_default for ConstColumn (#7319) (#7401)
(cherry picked from commit 8e3889eaeb)
2022-06-17 14:58:22 +08:00
mergify[bot] 3f64790c97
[BugFix] fix inconsistent http method since BE had used POST for transaction action (#6990) (#7373)
(cherry picked from commit 7c46402a34)

Co-authored-by: meegoo <hujie-dlut@qq.com>
2022-06-17 14:57:49 +08:00
mergify[bot] 8970dbc66b
[Bugfix] Avoid using released resources for ContextWithDependency after closing (backport #7363) (#7387) 2022-06-17 13:32:13 +08:00
zihe.liu 38174446b0
[Enhance] Estimate scan row bytes dynamically in time (backport #7202) (#7390) 2022-06-17 13:29:31 +08:00
Youngwb d30ba96e2b [BugFix] Fix count distinct multi columns with avg distinct plan (#7370)
(cherry picked from commit 814666dc63)
2022-06-17 12:59:28 +08:00
mergify[bot] 2e0880ca82
[BugFix] only concern primitive type when comparing ConstantOperator (#7277) (#7366)
(cherry picked from commit 3252034d37)

Co-authored-by: eyes_on_me <3675229+silverbullet233@users.noreply.github.com>
2022-06-17 12:27:11 +08:00
mergify[bot] 2d2a7ae33b
[BugFix] fix dop estimation for one-phase aggregation (backport #7278) (#7365) 2022-06-17 11:02:19 +08:00
Seaven 8ea310d002 [Bug] Fix in-subquery cast to outer join bug (#7090)
when left table column is null and result of subquery is empty, case when will hit null, not false

```
MySQL w2> select t0_27.c_0_2, ((t0_27.c_0_2) IN ((SELECT t0_27.c_0_2 FROM t0 AS t0_27 WHERE (t1_28.c_1_0) IN ('1970-01-18 15:14:39') ) ) ) from t0 AS t0_27, t1 AS t1_28;
+--------------------+-------------------------------------------------------------------------------------+
| c_0_2              | c_0_2 IN (((SELECT c_0_2 FROM t0 AS t0_27 WHERE c_1_0 IN ('1970-01-18 15:14:39')))) |
+--------------------+-------------------------------------------------------------------------------------+
| 0.9736537942295025 | 0                                                                                   |
| 0.5452490003832424 | 0                                                                                   |
| 0.5452490003832424 | 0                                                                                   |
| <null>             | <null>                                                                              |
| <null>             | <null>                                                                              |
| 0.9736537942295025 | 0                                                                                   |
| 0.5452490003832424 | 0                                                                                   |
| 0.5452490003832424 | 0                                                                                   |
+--------------------+-------------------------------------------------------------------------------------+
8 rows in set
Time: 0.077s
```

should be 
```
MySQL w2> select t0_27.c_0_2, ((t0_27.c_0_2) IN ((SELECT t0_27.c_0_2 FROM t0 AS t0_27 WHERE (t1_28.c_1_0) IN ('1970-01-18 15:14:39') ) ) ) from t0 AS t0_27, t1 AS t1_28;
+--------------------+-------------------------------------------------------------------------------------+
| c_0_2              | c_0_2 IN (((SELECT c_0_2 FROM t0 AS t0_27 WHERE c_1_0 IN ('1970-01-18 15:14:39')))) |
+--------------------+-------------------------------------------------------------------------------------+
| 0.9736537942295025 | 0                                                                                   |
| 0.5452490003832424 | 0                                                                                   |
| 0.5452490003832424 | 0                                                                                   |
| <null>             | 0                                                                                   |
| <null>             | 0                                                                                   |
| 0.9736537942295025 | 0                                                                                   |
| 0.5452490003832424 | 0                                                                                   |
| 0.5452490003832424 | 0                                                                                   |
+--------------------+-------------------------------------------------------------------------------------+
8 rows in set
Time: 0.077s

```
2022-06-17 10:47:58 +08:00
Seaven 5727e5a5fb [BugFix] Fix throw ConcurrentModificationException use partition range (#7076) 2022-06-17 10:47:58 +08:00
Seaven 8c4873aab8 [BugFix] Fix join reorder push project error (#6866) 2022-06-17 10:47:58 +08:00
Seaven 2a5f1434d9 [BugFix] CTEAnchor must return child property (#6806) 2022-06-17 10:47:58 +08:00
trueeyu 5b21ff9450 Fix the problem of estimate memtable size (#7316)
The size calculation cost is relatively large for some types of `Column` such as `BitmapColumn`, so the estimation method of this `MemTable` Size is changed to the incremental before.

When importing, a `Chunk` will involve multiple Tablets. If there are large differences in length in different Tablets, for example, the average length of the string in `Tablet` 1 is 100 bytes, and the average length of string in `Tablet` 2 are small or is null. The estimated size of memtable is `chunk_bytes_usage += chunk.bytes_usage() * size / chunk.num_rows()`.

However, in extreme cases, due to incorrect estimation, the `MemTable` will be too large or too small, or even exceed 4G, resulting in crash or data confusion.

The new strategy is to Incremental calculate directly with chunk of `MemTable`

Before modification, MemTable Size

```
...
FLUSH:16949250
FLUSH:16983890
FLUSH:17078640
FLUSH:17270930
FLUSH:17073270
FLUSH:16934340
FLUSH:16890280
FLUSH:17088330
FLUSH:17014780
FLUSH:17002530
FLUSH:17299380
FLUSH:7911050
FLUSH:2497327728
FLUSH:2488632580
FLUSH:3305326048
FLUSH:3309594000
FLUSH:3303867992
FLUSH:1495277328
FLUSH:4116982500
...
```

After modification, MemTable size

```
...
FLUSH:104910526
FLUSH:104868988
FLUSH:104885820
FLUSH:104871402
FLUSH:104864160
FLUSH:104893658
FLUSH:104868340
FLUSH:104864715
FLUSH:104868988
FLUSH:104890714
FLUSH:104861746
FLUSH:104868356
FLUSH:104860294
FLUSH:104874798
FLUSH:104893900
FLUSH:104876775
FLUSH:104872573
FLUSH:104861746
FLUSH:104871402
FLUSH:104888300
FLUSH:104876720
...
```
2022-06-17 09:45:28 +08:00
mergify[bot] 51f0861834
canonicalize decimal types when desc table (#7307) (#7328)
decimal32(p,s)/decimal64(p,s)/decimal128(p,s) should show as decimal(p,s) in stmts as follow:
```
mysql> desc t0;
+-------+--------------+------+-------+---------+-------+
| Field | Type         | Null | Key   | Default | Extra |
+-------+--------------+------+-------+---------+-------+
| k0    | INT          | No   | true  | NULL    |       |
| c0    | DECIMAL(9,2) | No   | false | NULL    |       |
| c1    | DECIMAL(9,2) | No   | false | NULL    |       |
| c2    | DECIMAL(9,2) | No   | false | NULL    |       |
+-------+--------------+------+-------+---------+-------+
4 rows in set (0.00 sec)

mysql> desc t0 all;
+-----------+---------------+-------+--------------+------+-------+---------+-------+
| IndexName | IndexKeysType | Field | Type         | Null | Key   | Default | Extra |
+-----------+---------------+-------+--------------+------+-------+---------+-------+
| t0        | DUP_KEYS      | k0    | INT          | No   | true  | NULL    |       |
|           |               | c0    | DECIMAL(9,2) | No   | false | NULL    | NONE  |
|           |               | c1    | DECIMAL(9,2) | No   | false | NULL    | NONE  |
|           |               | c2    | DECIMAL(9,2) | No   | false | NULL    | NONE  |
+-----------+---------------+-------+--------------+------+-------+---------+-------+
4 rows in set (0.00 sec)

mysql> show columns from t0;
+-------+--------------+------+------+---------+-------+
| Field | Type         | Null | Key  | Default | Extra |
+-------+--------------+------+------+---------+-------+
| k0    | int          | NO   | YES  | NULL    |       |
| c0    | decimal(9,2) | NO   | NO   | NULL    |       |
| c1    | decimal(9,2) | NO   | NO   | NULL    |       |
| c2    | decimal(9,2) | NO   | NO   | NULL    |       |
+-------+--------------+------+------+---------+-------+
4 rows in set (0.00 sec)
```

(cherry picked from commit 4fb5cf4247)

Co-authored-by: satanson <ranpanf@gmail.com>
2022-06-16 14:56:39 +08:00
rickif 04a3d799e3 [BugFix] `get_json_string` returns NULL (#7243)
This PR fixes the problem that the get_json_string returns NULL when a JSON array is an input. The previous implementation of get_json_xx is implemented with simdjson directly, which is confusing and hard to maintain. The new implement unifies the implement of get_json_xx and the functions of JsonValue and simplifies the get_json_xx.
2022-06-16 14:18:57 +08:00
Seaven 033f7e5285 [BugFix] window push down runtime filter (#7206) 2022-06-16 10:03:18 +08:00
Murphy d6eb26f707 [BugFix] fix #7107: throw exception when creating unsupported query_type (#7237)
(cherry picked from commit 7b7e97672e)
2022-06-16 10:02:20 +08:00
mergify[bot] 4468338b0f
set HikariCP log level to ERROR (#7306) (#7311)
(cherry picked from commit 6f3a93f083)

Co-authored-by: eyes_on_me <3675229+silverbullet233@users.noreply.github.com>
2022-06-16 09:16:46 +08:00
mergify[bot] 3e6c0446b2
[Bugfix]fix BE crash in hash join when probe stage set_finishing before build stage (backport #7155) (#7182)
(cherry picked from commit dc96d16506)

Co-authored-by: stdpain <34912776+stdpain@users.noreply.github.com>
2022-06-15 23:36:06 +08:00
mergify[bot] 80d953aa1c
[BugFix] fix sorting of overflowed nullable column (#7231) (#7283) 2022-06-15 23:34:20 +08:00
mergify[bot] 258fb77296
[Bugfix] Avoid NPE for operator status method after closing (#7255) (#7291)
(cherry picked from commit d30ccbb7ef)

Co-authored-by: zihe.liu <ziheliu1024@gmail.com>
2022-06-15 23:32:21 +08:00
stdpain 0a04e897f7 [Enhancement] make log clearer when create Expr (#7133)
(cherry picked from commit 046b341755)
2022-06-15 19:43:22 +08:00
mergify[bot] 24d95208d5
[BugFix] Error when delete nothing (#7172) (#7267)
This PR fixes the problem that the SR returns the error `ERROR 1064 (HY000): all partitions have no load data `when the delete statement deletes nothing.
2022-06-15 14:53:14 +08:00
Murphy 0ad39232a4
[BugFix] fix memory statistic in local passthrough (#7183) (#7234)
(cherry picked from commit ba1b9acc06)
2022-06-15 13:41:35 +08:00
mergify[bot] f2d888d4d6
[BugFix] Fix Agg table use replace agg function when load_dop is not 1 will make data disorder (#7200) (#7221)
Agg table use replace agg function when load_dop is not 1 will make data disorder

(cherry picked from commit efbdaff190)

Co-authored-by: meegoo <hujie-dlut@qq.com>
2022-06-15 12:20:52 +08:00
Youngwb 8a9d055286 Fix project node map error when group by constant (#7135)
(cherry picked from commit fff7bfc076)
2022-06-15 10:59:31 +08:00
mergify[bot] 93b7fe8da4
Use original sql for task definition (#7188) (#7250)
use ViewDefBuilder or AST2SQL can not convert execute sql correctly and user experience is not good.
Task have db session so it does not need to have cluster DB information, or other redundant transformations, so we only store raw SQL.

(cherry picked from commit 2a02375b6b)

Co-authored-by: xueyan.li <astralidea@163.com>
2022-06-15 09:51:42 +08:00
mergify[bot] aa71057c84
[BigFix] hive external table column order changes after refreshing (#7210) (#7226)
use List to construct schema instead of Map
2022-06-14 19:37:44 +08:00
stdpain 44131aa84c
[cherry-pick][feature] Support the use of cross-database use of UDF (backport #6865) (#7212) 2022-06-14 19:10:27 +08:00
kangkaisen d8d14a8f19 Fix create MaterializedView with new Analyzer bug (#6900) 2022-06-13 20:02:43 +08:00
mergify[bot] 8d39b3720d
[BugFix] showing unkonwn resource group should throw Exception (backport #6888) (#6980) 2022-06-13 15:28:40 +08:00
Murphy 575e1a0254
[Refactor] change database classifier of resource group to db (#6896) (#7148)
(cherry picked from commit dc6e0adbe4)
2022-06-13 11:27:31 +08:00
mergify[bot] 62a0933b22
[Enhance] try to release chunks of operator when close (backport #7086) (#7137) 2022-06-12 22:41:10 +08:00
mergify[bot] 216e50101e
[BugFix] estimate output row bytes of OlapScanNode error (backport #7141) (#7142) 2022-06-12 22:39:54 +08:00
mergify[bot] 6437cf6c58
[Enhance] change strategy of pipeline_dop (backport #6950) (#7143) 2022-06-12 13:22:19 +08:00
mergify[bot] faa6521e24
[Enhance] replace query context expired timeout with query_delivery_timeout (backport #7085) (#7145) 2022-06-12 13:20:03 +08:00
mergify[bot] 496155966c
[StarRocks On ES] fix wanonly value error when read field (#7103) (#7126)
(cherry picked from commit b9443bb41f)

Co-authored-by: Rowen <105833710+RowenWoo@users.noreply.github.com>
2022-06-11 16:39:59 +08:00
Youngwb 14097008d1 [BugFix] Fix delete best expression not deal with enforcer when merge group (#6981)
(cherry picked from commit b3e2a9ba82)
2022-06-11 16:31:51 +08:00
liuyehcf 77e5cc103b TopDownRewriteTask add trace log 2022-06-11 15:41:54 +08:00
liuyehcf e77b62dbb7 [Enhance] add PENDING_FINISH_TIME in profile (#7074) 2022-06-11 15:41:54 +08:00
liuyehcf 926912d842 [BugFix] Remove duplicate limit operator (#6987) 2022-06-11 15:41:54 +08:00
liuyehcf e890ff90dc [BugFix] Update error msg when apply analytic function on array type (#6961) 2022-06-11 15:41:54 +08:00
liuyehcf c105df67ad [BugFix] Fixed inaccurate of WaitTime estimation (#6938) 2022-06-11 15:41:54 +08:00
liuyehcf 3d343bb80e [BugFix] Fix index out of bounds exception of profile merge mechanism (#6789) 2022-06-11 15:41:54 +08:00
rickif bba404ce01 [BugFix] Support transferring json using HTTP chunk (#6972)
This PR is the last part of the fix of #5515, including the following modifications:
1. building JSON buffer in the handler of stream load to support chunked JSON HTTP body
2. removing `Status MessageBodySink::append(const ByteBufferPtr& buf)` to reduce memory copy
2022-06-11 15:22:18 +08:00
gengjun-git 7a587dfcf7 [FixBug] Fix lose of meta data bug after alter routine load (#6937)
When routine load serializes the load properties(columns/rows separator, column list, partitions, where filter), it stores the sql statement, and get the load properties by parsing the sql for deserialization. But after alter routine load is done, it will only keep the alter statement, this will cause the loss of metadata. We should merge the alter sql with the origin create sql to retain all load properties.
2022-06-11 15:17:36 +08:00
mergify[bot] 394a8acc2a
Fix the default encoding of largeint (#7113) (#7116)
(cherry picked from commit dc22c8d0b7)

Co-authored-by: Murphy <96611012+mofeiatwork@users.noreply.github.com>
2022-06-11 14:06:37 +08:00
mergify[bot] 4554982b2b
[BugFix] fix Java UDF crash (backport #7110) (#7112) 2022-06-11 13:19:33 +08:00
mergify[bot] ab4273b71d
update name and type when showing catalogs (#7093) (#7101)
(cherry picked from commit 44c41e9c75)

Co-authored-by: stephen <91597003+stephen-shelby@users.noreply.github.com>
2022-06-11 10:17:03 +08:00
stdpain 2f5e5c2387 [Bugfix] fix call JNI in bthread in Java UDF (#7092)
(cherry picked from commit 482f841409)
2022-06-10 22:35:11 +08:00
gengjun-git 63e3186123
[BugFix] fix the npe bug when replaying the tablet consistency check data (#6685) (#7067)
Only check the OLAP table who is in NORMAL state. Because some tablets of the not NORMAL table may just a temporary presence in memory, if we check those tablets and log FinishConsistencyCheck to bdb, it will throw NullPointerException when replaying the log.
2022-06-10 20:46:10 +08:00
meegoo d8ce6334d0 [BugFix] Fix be crash when flush mem table is null (#7005)
(cherry picked from commit d323b2a246)
2022-06-10 16:46:04 +08:00
zihe.liu 3dbe447270
[Bugfix] process returned status correctly (#7040) (#7070) 2022-06-10 12:47:44 +08:00
mergify[bot] 3165526f9f
[Enhance] change max_transmit_batched_bytes from 64KB to 256KB (#6959) (#7029)
(cherry picked from commit cd84601267)

Co-authored-by: Murphy <96611012+mofeiatwork@users.noreply.github.com>
2022-06-09 13:22:49 +08:00
mergify[bot] 2d4aa27e19
Fix CTAS speculation replication_num is wrong (#6994) (#7031)
The old code may have forgotten to add the logic of this piece, add UT to make this problem no longer occur.

(cherry picked from commit 536b979fef)

Co-authored-by: xueyan.li <astralidea@163.com>
2022-06-09 12:57:29 +08:00
mergify[bot] 925929ad3d
[Refactor] remove JsonValue from datum (backport #6887) (#6921) 2022-06-09 10:02:15 +08:00
mergify[bot] 39f295cc9a
[Bugfix] fix NPE in sort (backport #6949) (#6964) 2022-06-09 10:01:20 +08:00
mergify[bot] 42a3ee01fb
[BugFix] Fix merge limit error when subquery has limit with offset (backport #6920) (#7010) 2022-06-09 09:37:25 +08:00
mergify[bot] 7a0d53dd66
[BugFix] Fix the bitmap index for bool (#7003) (#7023)
(cherry picked from commit 7dfda17667)

Co-authored-by: Murphy <96611012+mofeiatwork@users.noreply.github.com>
2022-06-08 22:58:49 +08:00
mergify[bot] f52504e251
[Enhancement] Optimize task query metadata when there are many dbs (#6893) (#7007)
We can first send an RPC to count which DBs all Task/TaskRun contains and then reduce the RPC query for each DB.
Usually, users do not build many DBs and send Tasks in different DBs simultaneously.
It can be further optimized later, and the indexes of which DBs are included in the Task are built in advance. At present, no deep optimization is performed.

(cherry picked from commit 9e1272b79f)

Co-authored-by: xueyan.li <astralidea@163.com>
2022-06-08 20:21:39 +08:00
mergify[bot] db473aea53
[BugFix] Fix npe when executing queryDictSync (#6968) (#6995)
(cherry picked from commit ff3d5230e2)

Co-authored-by: stephen <91597003+stephen-shelby@users.noreply.github.com>
2022-06-08 18:36:35 +08:00
mergify[bot] 908acd948d
Fix UT RoutineLoadManagerTest.testExpired 2 (#6908) (#6983)
This UT failed because the time setting was not good. The time was changed after the metadata was saved, and there was a logic to re-read the metadata. At this time, the previously set time was invalid, which caused the UT to always fail. so. Just put this setup time before the save element.

(cherry picked from commit 269dede54f)

Co-authored-by: xueyan.li <astralidea@163.com>
2022-06-08 15:23:39 +08:00
mergify[bot] a0e51a8581
[Enhancement][Log] Add pending to running state change log for task (#6831) (#6977)
The RUNNING state is missing when querying INFORMATION_SCHEMA the execution state of the task through FOLLOWER, so add it.
2022-06-08 14:33:02 +08:00
mergify[bot] b7c168058b
[Feature] Add id field to Catalog (#6933) (#6948)
use catalog_id as reserved fields for later support operations such as rename
2022-06-07 18:42:27 +08:00
waittting 6d7e275f9f
Revert "[Feature] Make StarRocks support FQDN (#5127)" (#6945) (#6946)
This reverts commit aed944f3fb.
2022-06-07 15:18:34 +08:00
mergify[bot] c297e1d266
[Bugfix]support query external_catalog table when current_catalog is default_catalog (#6903) (#6917)
TableName#tosql() add catalog field
AstBuilder#visitColumnReference() adapt to column names which has four level.

(cherry picked from commit 0e2b8a9faf)

Co-authored-by: stephen <91597003+stephen-shelby@users.noreply.github.com>
2022-06-07 11:22:59 +08:00
stephen 01ee573f5f
[cherry-pick] Get hive column statistics downgrade policy (#4409) (#6899) (#6926) 2022-06-07 10:55:07 +08:00
1012 changed files with 48684 additions and 6342 deletions

18
.github/workflows/add-pr-label.yml vendored Normal file
View File

@ -0,0 +1,18 @@
name: Labels
on:
pull_request_target:
types:
- opened
paths:
- 'docs/**'
jobs:
pr-label:
runs-on: ubuntu-latest
steps:
- name: add document label
uses: actions-ecosystem/action-add-labels@v1
with:
github_token: ${{ secrets.GITHUB_TOKEN }}
labels: documentation

54
.github/workflows/sonar4fe.yml vendored Normal file
View File

@ -0,0 +1,54 @@
name: FE Sonar Build
on:
push:
branches:
- branch-2.3
pull_request:
paths:
- 'fe/**.java'
- 'fe/**.xml'
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
with:
fetch-depth: 0 # Shallow clones should be disabled for a better relevancy of analysis
- name: Set up JDK 11
uses: actions/setup-java@v3
with:
java-version: 11
distribution: 'adopt'
- name: Cache SonarCloud packages
uses: actions/cache@v3
with:
path: ~/.sonar/cache
key: ${{ runner.os }}-sonar
restore-keys: ${{ runner.os }}-sonar
- name: Cache Maven packages
uses: actions/cache@v3
with:
path: ~/.m2
key: ${{ runner.os }}-maven-${{ hashFiles('**/pom.xml') }}
restore-keys: ${{ runner.os }}-maven
- name: Setup thrift
uses: dodopizza/setup-thrift@v1
with:
version: 0.13.0
- name: Analyze FE
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} # Needed to get PR information, if any
SONAR_TOKEN: 391d6539e2d09aed3d187353bfd85fefa7a4c281 # ${{ secrets.SONAR_TOKEN }}
run: |
thrift --version
whereis thrift
mkdir -p thirdparty/installed/bin/
cd thirdparty/installed/bin/ && ln -s /usr/local/bin/thrift thrift
cd ${{ github.workspace }}/fe
mvn -B -DskipTests verify org.sonarsource.scanner.maven:sonar-maven-plugin:sonar -Dsonar.projectKey=StarRocks_starrocks -Dsonar.pullrequest.key=${{ github.event.number }} -Dsonar.pullrequest.base=${{ github.base_ref }} -Dsonar.pullrequest.branch=${{ github.head_ref }}

View File

@ -740,6 +740,7 @@ install(FILES
${BASE_DIR}/../conf/be.conf
${BASE_DIR}/../conf/cn.conf
${BASE_DIR}/../conf/hadoop_env.sh
${BASE_DIR}/../conf/log4j.properties
DESTINATION ${OUTPUT_DIR}/conf)
install(DIRECTORY

View File

@ -34,7 +34,6 @@
#include "storage/storage_engine.h"
#include "storage/utils.h"
#include "util/debug_util.h"
#include "util/network_util.h"
#include "util/thrift_server.h"
using std::fstream;
@ -80,52 +79,11 @@ Status HeartbeatServer::_heartbeat(const TMasterInfo& master_info) {
if (master_info.__isset.backend_ip) {
if (master_info.backend_ip != BackendOptions::get_localhost()) {
LOG(INFO) << master_info.backend_ip << " not equal to to backend localhost "
<< BackendOptions::get_localhost();
if (is_valid_ip(master_info.backend_ip)) {
LOG(WARNING) << "backend ip saved in master does not equal to backend local ip"
<< master_info.backend_ip << " vs. " << BackendOptions::get_localhost();
std::stringstream ss;
ss << "actual backend local ip: " << BackendOptions::get_localhost();
return Status::InternalError(ss.str());
}
std::string ip = hostname_to_ip(master_info.backend_ip);
if (ip.empty()) {
std::stringstream ss;
ss << "can not get ip from fqdn: " << master_info.backend_ip;
LOG(WARNING) << ss.str();
return Status::InternalError(ss.str());
}
std::vector<InetAddress> hosts;
Status status = get_hosts_v4(&hosts);
if (!status.ok() || hosts.empty()) {
std::stringstream ss;
ss << "the status was not ok when get_hosts_v4, error is " << status.get_error_msg();
LOG(WARNING) << ss.str();
return Status::InternalError(ss.str());
}
bool set_new_localhost = false;
for (std::vector<InetAddress>::iterator addr_it = hosts.begin(); addr_it != hosts.end(); ++addr_it) {
if (addr_it->is_address_v4() && addr_it->get_host_address_v4() == ip) {
BackendOptions::set_localhost(master_info.backend_ip);
set_new_localhost = true;
break;
}
}
if (!set_new_localhost) {
std::stringstream ss;
ss << "the host recorded in master is " << master_info.backend_ip
<< ", but we cannot found the local ip that mapped to that host." << BackendOptions::get_localhost();
LOG(WARNING) << ss.str();
return Status::InternalError(ss.str());
}
LOG(INFO) << "update localhost done, the new localhost is " << BackendOptions::get_localhost();
LOG(WARNING) << "backend ip saved in master does not equal to backend local ip" << master_info.backend_ip
<< " vs. " << BackendOptions::get_localhost();
std::stringstream ss;
ss << "actual backend local ip: " << BackendOptions::get_localhost();
return Status::InternalError(ss.str());
}
}

View File

@ -390,7 +390,7 @@ void* TaskWorkerPool::_create_tablet_worker_thread_callback(void* arg_this) {
TFinishTaskRequest finish_task_request;
finish_task_request.__set_finish_tablet_infos(finish_tablet_infos);
finish_task_request.__set_backend(BackendOptions::get_localBackend());
finish_task_request.__set_backend(worker_pool_this->_backend);
finish_task_request.__set_report_version(_s_report_version);
finish_task_request.__set_task_type(agent_task_req.task_type);
finish_task_request.__set_signature(agent_task_req.signature);
@ -445,7 +445,7 @@ void* TaskWorkerPool::_drop_tablet_worker_thread_callback(void* arg_this) {
task_status.__set_error_msgs(error_msgs);
TFinishTaskRequest finish_task_request;
finish_task_request.__set_backend(BackendOptions::get_localBackend());
finish_task_request.__set_backend(worker_pool_this->_backend);
finish_task_request.__set_task_type(agent_task_req.task_type);
finish_task_request.__set_signature(agent_task_req.signature);
finish_task_request.__set_task_status(task_status);
@ -547,7 +547,7 @@ void TaskWorkerPool::_alter_tablet(TaskWorkerPool* worker_pool_this, const TAgen
}
// Return result to fe
finish_task_request->__set_backend(BackendOptions::get_localBackend());
finish_task_request->__set_backend(_backend);
finish_task_request->__set_report_version(_s_report_version);
finish_task_request->__set_task_type(task_type);
finish_task_request->__set_signature(signature);
@ -684,7 +684,7 @@ void* TaskWorkerPool::_push_worker_thread_callback(void* arg_this) {
TStatus task_status;
TFinishTaskRequest finish_task_request;
finish_task_request.__set_backend(BackendOptions::get_localBackend());
finish_task_request.__set_backend(worker_pool_this->_backend);
finish_task_request.__set_task_type(agent_task_req.task_type);
finish_task_request.__set_signature(agent_task_req.signature);
if (push_req.push_type == TPushType::DELETE) {
@ -901,7 +901,7 @@ void* TaskWorkerPool::_publish_version_worker_thread_callback(void* arg_this) {
}
status.to_thrift(&finish_task_request.task_status);
finish_task_request.__set_backend(BackendOptions::get_localBackend());
finish_task_request.__set_backend(worker_pool_this->_backend);
finish_task_request.__set_task_type(publish_version_task->task_type);
finish_task_request.__set_signature(publish_version_task->signature);
finish_task_request.__set_report_version(_s_report_version);
@ -990,7 +990,7 @@ void* TaskWorkerPool::_clear_transaction_task_worker_thread_callback(void* arg_t
TFinishTaskRequest finish_task_request;
finish_task_request.__set_task_status(task_status);
finish_task_request.__set_backend(BackendOptions::get_localBackend());
finish_task_request.__set_backend(worker_pool_this->_backend);
finish_task_request.__set_task_type(agent_task_req.task_type);
finish_task_request.__set_signature(agent_task_req.signature);
@ -1053,7 +1053,7 @@ void* TaskWorkerPool::_update_tablet_meta_worker_thread_callback(void* arg_this)
// because the primary index is available in cache
// But it will be remove from index cache after apply is finished
auto manager = StorageEngine::instance()->update_manager();
manager->index_cache().remove_by_key(tablet->tablet_id());
manager->index_cache().try_remove_by_key(tablet->tablet_id());
break;
}
}
@ -1067,7 +1067,7 @@ void* TaskWorkerPool::_update_tablet_meta_worker_thread_callback(void* arg_this)
TFinishTaskRequest finish_task_request;
finish_task_request.__set_task_status(task_status);
finish_task_request.__set_backend(BackendOptions::get_localBackend());
finish_task_request.__set_backend(worker_pool_this->_backend);
finish_task_request.__set_task_type(agent_task_req.task_type);
finish_task_request.__set_signature(agent_task_req.signature);
@ -1105,7 +1105,7 @@ void* TaskWorkerPool::_clone_worker_thread_callback(void* arg_this) {
// Return result to fe
TStatus task_status;
TFinishTaskRequest finish_task_request;
finish_task_request.__set_backend(BackendOptions::get_localBackend());
finish_task_request.__set_backend(worker_pool_this->_backend);
finish_task_request.__set_task_type(agent_task_req.task_type);
finish_task_request.__set_signature(agent_task_req.signature);
@ -1199,7 +1199,7 @@ void* TaskWorkerPool::_storage_medium_migrate_worker_thread_callback(void* arg_t
std::vector<std::string> error_msgs;
TStatus task_status;
TFinishTaskRequest finish_task_request;
finish_task_request.__set_backend(BackendOptions::get_localBackend());
finish_task_request.__set_backend(worker_pool_this->_backend);
finish_task_request.__set_task_type(agent_task_req.task_type);
finish_task_request.__set_signature(agent_task_req.signature);
@ -1322,7 +1322,7 @@ void* TaskWorkerPool::_check_consistency_worker_thread_callback(void* arg_this)
task_status.__set_error_msgs(error_msgs);
TFinishTaskRequest finish_task_request;
finish_task_request.__set_backend(BackendOptions::get_localBackend());
finish_task_request.__set_backend(worker_pool_this->_backend);
finish_task_request.__set_task_type(agent_task_req.task_type);
finish_task_request.__set_signature(agent_task_req.signature);
finish_task_request.__set_task_status(task_status);
@ -1339,6 +1339,7 @@ void* TaskWorkerPool::_report_task_worker_thread_callback(void* arg_this) {
TaskWorkerPool* worker_pool_this = (TaskWorkerPool*)arg_this;
TReportRequest request;
request.__set_backend(worker_pool_this->_backend);
while ((!worker_pool_this->_stopped)) {
if (worker_pool_this->_master_info.network_address.port == 0) {
@ -1356,7 +1357,6 @@ void* TaskWorkerPool::_report_task_worker_thread_callback(void* arg_this) {
}
}
request.__set_tasks(tasks);
request.__set_backend(BackendOptions::get_localBackend());
StarRocksMetrics::instance()->report_task_requests_total.increment(1);
TMasterResult result;
@ -1378,6 +1378,7 @@ void* TaskWorkerPool::_report_disk_state_worker_thread_callback(void* arg_this)
TaskWorkerPool* worker_pool_this = (TaskWorkerPool*)arg_this;
TReportRequest request;
request.__set_backend(worker_pool_this->_backend);
while ((!worker_pool_this->_stopped)) {
if (worker_pool_this->_master_info.network_address.port == 0) {
@ -1411,7 +1412,6 @@ void* TaskWorkerPool::_report_disk_state_worker_thread_callback(void* arg_this)
StarRocksMetrics::instance()->disks_state.set_metric(root_path_info.path, root_path_info.is_used ? 1L : 0L);
}
request.__set_disks(disks);
request.__set_backend(BackendOptions::get_localBackend());
StarRocksMetrics::instance()->report_disk_requests_total.increment(1);
TMasterResult result;
@ -1434,6 +1434,7 @@ void* TaskWorkerPool::_report_tablet_worker_thread_callback(void* arg_this) {
TaskWorkerPool* worker_pool_this = (TaskWorkerPool*)arg_this;
TReportRequest request;
request.__set_backend(worker_pool_this->_backend);
request.__isset.tablets = true;
AgentStatus status = STARROCKS_SUCCESS;
@ -1459,7 +1460,6 @@ void* TaskWorkerPool::_report_tablet_worker_thread_callback(void* arg_this) {
std::max(StarRocksMetrics::instance()->tablet_cumulative_max_compaction_score.value(),
StarRocksMetrics::instance()->tablet_base_max_compaction_score.value());
request.__set_tablet_max_compaction_score(max_compaction_score);
request.__set_backend(BackendOptions::get_localBackend());
TMasterResult result;
status = worker_pool_this->_master_client->report(request, &result);
@ -1482,6 +1482,7 @@ void* TaskWorkerPool::_report_workgroup_thread_callback(void* arg_this) {
TaskWorkerPool* worker_pool_this = (TaskWorkerPool*)arg_this;
TReportRequest request;
request.__set_backend(worker_pool_this->_backend);
AgentStatus status = STARROCKS_SUCCESS;
while ((!worker_pool_this->_stopped)) {
@ -1497,7 +1498,6 @@ void* TaskWorkerPool::_report_workgroup_thread_callback(void* arg_this) {
request.__set_report_version(_s_report_version);
auto workgroups = workgroup::WorkGroupManager::instance()->list_workgroups();
request.__set_active_workgroups(std::move(workgroups));
request.__set_backend(BackendOptions::get_localBackend());
TMasterResult result;
status = worker_pool_this->_master_client->report(request, &result);
@ -1555,7 +1555,7 @@ void* TaskWorkerPool::_upload_worker_thread_callback(void* arg_this) {
task_status.__set_error_msgs(error_msgs);
TFinishTaskRequest finish_task_request;
finish_task_request.__set_backend(BackendOptions::get_localBackend());
finish_task_request.__set_backend(worker_pool_this->_backend);
finish_task_request.__set_task_type(agent_task_req.task_type);
finish_task_request.__set_signature(agent_task_req.signature);
finish_task_request.__set_task_status(task_status);
@ -1611,7 +1611,7 @@ void* TaskWorkerPool::_download_worker_thread_callback(void* arg_this) {
task_status.__set_error_msgs(error_msgs);
TFinishTaskRequest finish_task_request;
finish_task_request.__set_backend(BackendOptions::get_localBackend());
finish_task_request.__set_backend(worker_pool_this->_backend);
finish_task_request.__set_task_type(agent_task_req.task_type);
finish_task_request.__set_signature(agent_task_req.signature);
finish_task_request.__set_task_status(task_status);
@ -1685,7 +1685,7 @@ void* TaskWorkerPool::_make_snapshot_thread_callback(void* arg_this) {
task_status.__set_error_msgs(error_msgs);
TFinishTaskRequest finish_task_request;
finish_task_request.__set_backend(BackendOptions::get_localBackend());
finish_task_request.__set_backend(worker_pool_this->_backend);
finish_task_request.__set_task_type(agent_task_req.task_type);
finish_task_request.__set_signature(agent_task_req.signature);
finish_task_request.__set_snapshot_path(snapshot_path);
@ -1739,7 +1739,7 @@ void* TaskWorkerPool::_release_snapshot_thread_callback(void* arg_this) {
task_status.__set_error_msgs(error_msgs);
TFinishTaskRequest finish_task_request;
finish_task_request.__set_backend(BackendOptions::get_localBackend());
finish_task_request.__set_backend(worker_pool_this->_backend);
finish_task_request.__set_task_type(agent_task_req.task_type);
finish_task_request.__set_signature(agent_task_req.signature);
finish_task_request.__set_task_status(task_status);
@ -1807,7 +1807,7 @@ void* TaskWorkerPool::_move_dir_thread_callback(void* arg_this) {
task_status.__set_error_msgs(error_msgs);
TFinishTaskRequest finish_task_request;
finish_task_request.__set_backend(BackendOptions::get_localBackend());
finish_task_request.__set_backend(worker_pool_this->_backend);
finish_task_request.__set_task_type(agent_task_req.task_type);
finish_task_request.__set_signature(agent_task_req.signature);
finish_task_request.__set_task_status(task_status);

View File

@ -45,7 +45,8 @@ uint8_t* ArrayColumn::mutable_raw_data() {
size_t ArrayColumn::byte_size(size_t from, size_t size) const {
DCHECK_LE(from + size, this->size()) << "Range error";
return _elements->byte_size(_offsets->get_data()[from], _offsets->get_data()[from + size]) +
return _elements->byte_size(_offsets->get_data()[from],
_offsets->get_data()[from + size] - _offsets->get_data()[from]) +
_offsets->Column::byte_size(from, size);
}
@ -130,6 +131,18 @@ void ArrayColumn::append_default(size_t count) {
_offsets->append_value_multiple_times(&offset, count);
}
void ArrayColumn::fill_default(const Filter& filter) {
std::vector<uint32_t> indexes;
for (size_t i = 0; i < filter.size(); i++) {
if (filter[i] == 1 && get_element_size(i) > 0) {
indexes.push_back(i);
}
}
auto default_column = clone_empty();
default_column->append_default(indexes.size());
update_rows(*default_column, indexes.data());
}
Status ArrayColumn::update_rows(const Column& src, const uint32_t* indexes) {
const auto& array_column = down_cast<const ArrayColumn&>(src);
@ -427,6 +440,11 @@ Datum ArrayColumn::get(size_t idx) const {
return Datum(res);
}
size_t ArrayColumn::get_element_size(size_t idx) const {
DCHECK_LT(idx + 1, _offsets->size());
return _offsets->get_data()[idx + 1] - _offsets->get_data()[idx];
}
bool ArrayColumn::set_null(size_t idx) {
return false;
}

View File

@ -80,6 +80,8 @@ public:
void append_default(size_t count) override;
void fill_default(const Filter& filter) override;
Status update_rows(const Column& src, const uint32_t* indexes) override;
void remove_first_n_values(size_t count) override {}
@ -119,6 +121,8 @@ public:
Datum get(size_t idx) const override;
size_t get_element_size(size_t idx) const;
bool set_null(size_t idx) override;
size_t memory_usage() const override { return _elements->memory_usage() + _offsets->memory_usage(); }

View File

@ -209,6 +209,23 @@ void BinaryColumnBase<T>::_build_slices() const {
_slices_cache = true;
}
template <typename T>
void BinaryColumnBase<T>::fill_default(const Filter& filter) {
std::vector<uint32_t> indexes;
for (size_t i = 0; i < filter.size(); i++) {
size_t len = _offsets[i + 1] - _offsets[i];
if (filter[i] == 1 && len > 0) {
indexes.push_back(i);
}
}
if (indexes.empty()) {
return;
}
auto default_column = clone_empty();
default_column->append_default(indexes.size());
update_rows(*default_column, indexes.data());
}
template <typename T>
Status BinaryColumnBase<T>::update_rows(const Column& src, const uint32_t* indexes) {
const auto& src_column = down_cast<const BinaryColumnBase<T>&>(src);

View File

@ -181,6 +181,8 @@ public:
_slices_cache = false;
}
void fill_default(const Filter& filter) override;
Status update_rows(const Column& src, const uint32_t* indexes) override;
uint32_t max_one_element_serialize_size() const override;
@ -242,6 +244,8 @@ public:
const Bytes& get_bytes() const { return _bytes; }
const uint8_t* continuous_data() const override { return reinterpret_cast<const uint8_t*>(_bytes.data()); }
Offsets& get_offset() { return _offsets; }
const Offsets& get_offset() const { return _offsets; }

View File

@ -85,6 +85,8 @@ public:
virtual uint8_t* mutable_raw_data() = 0;
virtual const uint8_t* continuous_data() const { return raw_data(); }
// Return number of values in column.
virtual size_t size() const = 0;
@ -147,6 +149,9 @@ public:
virtual void append(const Column& src) { append(src, 0, src.size()); }
// Update elements to default value which hit by the filter
virtual void fill_default(const Filter& filter) = 0;
// This function will update data from src according to the input indexes. 'indexes' contains
// the row index will be update
// For example:

View File

@ -35,6 +35,10 @@ void ConstColumn::append_value_multiple_times(const Column& src, uint32_t index,
append(src, index, size);
}
void ConstColumn::fill_default(const Filter& filter) {
CHECK(false) << "ConstColumn does not support update";
}
Status ConstColumn::update_rows(const Column& src, const uint32_t* indexes) {
return Status::NotSupported("ConstColumn does not support update");
}

View File

@ -116,6 +116,8 @@ public:
void append_default(size_t count) override { _size += count; }
void fill_default(const Filter& filter) override;
Status update_rows(const Column& src, const uint32_t* indexes) override;
uint32_t serialize(size_t idx, uint8_t* pos) override { return _data->serialize(0, pos); }

View File

@ -10,7 +10,6 @@
#include "types/date_value.hpp"
#include "types/timestamp_value.h"
#include "util/int96.h"
#include "util/json.h"
#include "util/percentile_value.h"
#include "util/slice.h"
@ -105,11 +104,6 @@ public:
}
}
template <typename T>
std::add_pointer_t<std::add_const_t<T>> get_if() const {
return std::get_if<std::remove_const_t<T>>(&_value);
}
template <typename T>
void set(T value) {
if constexpr (std::is_same_v<DateValue, T>) {
@ -125,21 +119,6 @@ public:
}
}
template <typename T>
void move_in(T&& value) {
if constexpr (std::is_same_v<DateValue, T>) {
_value = value.julian();
} else if constexpr (std::is_same_v<TimestampValue, T>) {
_value = value.timestamp();
} else if constexpr (std::is_same_v<bool, T>) {
_value = (int8_t)value;
} else if constexpr (std::is_unsigned_v<T>) {
_value = (std::make_signed_t<T>)value;
} else {
_value = std::move(value);
}
}
bool is_null() const { return _value.index() == 0; }
void set_null() { _value = std::monostate(); }
@ -150,15 +129,9 @@ public:
}
private:
// NOTE
// Either JsonValue and JsonValue* could stored in datum.
// - Pointer type JsonValue* is used as view-type, to navigate datum in a column without copy data
// - Value type JsonValue is used to store real data and own the value itself, which is mostly used to hold a
// JsonValue as return value. Right now only schema-change procedure use it.
using Variant =
std::variant<std::monostate, int8_t, uint8_t, int16_t, uint16_t, uint24_t, int32_t, uint32_t, int64_t,
uint64_t, int96_t, int128_t, Slice, decimal12_t, DecimalV2Value, float, double, DatumArray,
HyperLogLog*, BitmapValue*, PercentileValue*, JsonValue*, JsonValue>;
using Variant = std::variant<std::monostate, int8_t, uint8_t, int16_t, uint16_t, uint24_t, int32_t, uint32_t,
int64_t, uint64_t, int96_t, int128_t, Slice, decimal12_t, DecimalV2Value, float,
double, DatumArray, HyperLogLog*, BitmapValue*, PercentileValue*, JsonValue*>;
Variant _value;
};

View File

@ -10,6 +10,7 @@
#include "storage/decimal12.h"
#include "util/hash_util.hpp"
#include "util/mysql_row_buffer.h"
#include "util/value_generator.h"
namespace starrocks::vectorized {
@ -49,6 +50,16 @@ void FixedLengthColumnBase<T>::append_value_multiple_times(const Column& src, ui
}
}
template <typename T>
void FixedLengthColumnBase<T>::fill_default(const Filter& filter) {
T val = DefaultValueGenerator<T>::next_value();
for (size_t i = 0; i < filter.size(); i++) {
if (filter[i] == 1) {
_data[i] = val;
}
}
}
template <typename T>
Status FixedLengthColumnBase<T>::update_rows(const Column& src, const uint32_t* indexes) {
const T* src_data = reinterpret_cast<const T*>(src.raw_data());

View File

@ -129,6 +129,8 @@ public:
_data.resize(_data.size() + count, DefaultValueGenerator<ValueType>::next_value());
}
void fill_default(const Filter& filter) override;
Status update_rows(const Column& src, const uint32_t* indexes) override;
// The `_data` support one size(> 2^32), but some interface such as update_rows() will use uint32_t to

View File

@ -9,14 +9,9 @@
namespace starrocks::vectorized {
void JsonColumn::append_datum(const Datum& datum) {
if (const JsonValue* json = datum.get_if<JsonValue>()) {
append(json);
} else if (JsonValue* const* json_p = datum.get_if<JsonValue*>()) {
append(*json_p);
} else {
CHECK(false) << "invalid datum type";
}
append(datum.get<JsonValue*>());
}
int JsonColumn::compare_at(size_t left_idx, size_t right_idx, const starrocks::vectorized::Column& rhs,
int nan_direction_hint) const {
JsonValue* x = get_object(left_idx);

View File

@ -18,7 +18,7 @@ NullableColumn::NullableColumn(MutableColumnPtr&& data_column, MutableColumnPtr&
<< "nullable column's data must be single column";
ColumnPtr ptr = std::move(null_column);
_null_column = std::static_pointer_cast<NullColumn>(ptr);
_has_null = SIMD::count_nonzero(_null_column->get_data());
_has_null = SIMD::contain_nonzero(_null_column->get_data(), 0);
}
NullableColumn::NullableColumn(ColumnPtr data_column, NullColumnPtr null_column)
@ -58,7 +58,7 @@ void NullableColumn::append(const Column& src, size_t offset, size_t count) {
_null_column->append(*c._null_column, offset, count);
_data_column->append(*c._data_column, offset, count);
_has_null = _has_null || SIMD::count_nonzero(&(c._null_column->get_data()[offset]), count);
_has_null = _has_null || SIMD::contain_nonzero(c._null_column->get_data(), offset, count);
} else {
_null_column->resize(_null_column->size() + count);
_data_column->append(src, offset, count);
@ -78,7 +78,7 @@ void NullableColumn::append_selective(const Column& src, const uint32_t* indexes
_null_column->append_selective(*src_column._null_column, indexes, from, size);
_data_column->append_selective(*src_column._data_column, indexes, from, size);
_has_null = _has_null || SIMD::count_nonzero(&_null_column->get_data()[orig_size], size);
_has_null = _has_null || SIMD::contain_nonzero(_null_column->get_data(), orig_size, size);
} else {
_null_column->resize(orig_size + size);
_data_column->append_selective(src, indexes, from, size);
@ -98,7 +98,7 @@ void NullableColumn::append_value_multiple_times(const Column& src, uint32_t ind
_null_column->append_value_multiple_times(*src_column._null_column, index, size);
_data_column->append_value_multiple_times(*src_column._data_column, index, size);
_has_null = _has_null || SIMD::count_nonzero(&_null_column->get_data()[orig_size], size);
_has_null = _has_null || SIMD::contain_nonzero(_null_column->get_data(), orig_size, size);
} else {
_null_column->resize(orig_size + size);
_data_column->append_value_multiple_times(src, index, size);
@ -157,6 +157,17 @@ void NullableColumn::append_value_multiple_times(const void* value, size_t count
null_column_data().insert(null_column_data().end(), count, 0);
}
void NullableColumn::fill_null_with_default() {
if (null_count() == 0) {
return;
}
_data_column->fill_default(_null_column->get_data());
}
void NullableColumn::update_has_null() {
_has_null = SIMD::contain_nonzero(_null_column->get_data(), 0);
}
Status NullableColumn::update_rows(const Column& src, const uint32_t* indexes) {
DCHECK_EQ(_null_column->size(), _data_column->size());
size_t replace_num = src.size();
@ -164,6 +175,8 @@ Status NullableColumn::update_rows(const Column& src, const uint32_t* indexes) {
const auto& c = down_cast<const NullableColumn&>(src);
RETURN_IF_ERROR(_null_column->update_rows(*c._null_column, indexes));
RETURN_IF_ERROR(_data_column->update_rows(*c._data_column, indexes));
// update rows may convert between null and not null, so we need count every times
update_has_null();
} else {
auto new_null_column = NullColumn::create();
new_null_column->get_data().insert(new_null_column->get_data().end(), replace_num, 0);
@ -345,7 +358,7 @@ void NullableColumn::check_or_die() const {
CHECK_EQ(_null_column->size(), _data_column->size());
// when _has_null=true, the column may have no null value, so don't check.
if (!_has_null) {
CHECK_EQ(SIMD::count_nonzero(_null_column->get_data()), 0);
CHECK(!SIMD::contain_nonzero(_null_column->get_data(), 0));
}
_data_column->check_or_die();
_null_column->check_or_die();

View File

@ -51,11 +51,12 @@ public:
void set_has_null(bool has_null) { _has_null = _has_null | has_null; }
void update_has_null() {
const NullColumn::Container& v = _null_column->get_data();
const auto* p = v.data();
_has_null = (p != nullptr) && (nullptr != memchr(p, 1, v.size() * sizeof(v[0])));
}
// Update null element to default value
void fill_null_with_default();
void fill_default(const Filter& filter) override {}
void update_has_null();
bool is_nullable() const override { return true; }

View File

@ -127,6 +127,16 @@ void ObjectColumn<T>::append_default(size_t count) {
}
}
template <typename T>
void ObjectColumn<T>::fill_default(const Filter& filter) {
for (size_t i = 0; i < filter.size(); i++) {
if (filter[i] == 1) {
_pool[i] = {};
}
}
_cache_ok = false;
}
template <typename T>
Status ObjectColumn<T>::update_rows(const Column& src, const uint32_t* indexes) {
const auto& obj_col = down_cast<const ObjectColumn<T>&>(src);

View File

@ -103,6 +103,8 @@ public:
void append_default(size_t count) override;
void fill_default(const Filter& filter) override;
Status update_rows(const Column& src, const uint32_t* indexes) override;
uint32_t serialize(size_t idx, uint8_t* pos) override;

View File

@ -23,8 +23,7 @@
#include "configbase.h"
namespace starrocks {
namespace config {
namespace starrocks::config {
// The cluster id.
CONF_Int32(cluster_id, "-1");
// The port on which ImpalaInternalService is exported.
@ -587,8 +586,8 @@ CONF_Int32(late_materialization_ratio, "10");
// `1000` will enable late materialization always select metric type.
CONF_Int32(metric_late_materialization_ratio, "1000");
// Max batched bytes for each transmit request.
CONF_Int64(max_transmit_batched_bytes, "65536");
// Max batched bytes for each transmit request. (256KB)
CONF_Int64(max_transmit_batched_bytes, "262144");
CONF_Int16(bitmap_max_filter_items, "30");
@ -743,6 +742,18 @@ CONF_String(starmgr_addr, "");
CONF_Int32(starlet_port, "9070");
#endif
} // namespace config
CONF_mBool(dependency_librdkafka_debug_enable, "false");
} // namespace starrocks
// A comma-separated list of debug contexts to enable.
// Producer debug context: broker, topic, msg
// Consumer debug context: consumer, cgrp, topic, fetch
// Other debug context: generic, metadata, feature, queue, protocol, security, interceptor, plugin
// admin, eos, mock, assigner, conf
CONF_String(dependency_librdkafka_debug, "all");
// Enable compression in table sink.
// The BE supports compression would get error when communicate with BE dose not support compression.
// For compatible consideration, we disable it by default.
CONF_Bool(table_sink_compression_enable, "false");
} // namespace starrocks::config

View File

@ -10,4 +10,6 @@ constexpr const int DEFAULT_CHUNK_SIZE = 4096;
// Chunk size for some huge type(HLL, JSON)
constexpr inline int CHUNK_SIZE_FOR_HUGE_TYPE = 4096;
constexpr inline int NUM_LOCK_SHARD_LOG = 5;
} // namespace starrocks

View File

@ -33,8 +33,12 @@ public:
// how many rows read from storage
virtual int64_t raw_rows_read() const = 0;
// how mnay rows returned after filtering.
// how many rows returned after filtering.
virtual int64_t num_rows_read() const = 0;
// how many bytes read from external
virtual int64_t num_bytes_read() const = 0;
// CPU time of this data source
virtual int64_t cpu_time_spent() const = 0;
// following fields are set by framework
// 1. runtime profile: any metrics you want to record

View File

@ -78,6 +78,12 @@ int64_t ESDataSource::raw_rows_read() const {
int64_t ESDataSource::num_rows_read() const {
return _rows_return_number;
}
int64_t ESDataSource::num_bytes_read() const {
return _bytes_read;
}
int64_t ESDataSource::cpu_time_spent() const {
return _cpu_time_ns;
}
Status ESDataSource::_build_conjuncts() {
Status status = Status::OK();
@ -218,6 +224,8 @@ Status ESDataSource::get_next(RuntimeState* state, vectorized::ChunkPtr* chunk)
return Status::EndOfFile("");
}
}
SCOPED_RAW_TIMER(&_cpu_time_ns);
{
SCOPED_TIMER(_materialize_timer);
RETURN_IF_ERROR(_es_scroll_parser->fill_chunk(state, chunk, &_line_eof));
@ -228,6 +236,7 @@ Status ESDataSource::get_next(RuntimeState* state, vectorized::ChunkPtr* chunk)
int64_t before = ck->num_rows();
COUNTER_UPDATE(_rows_read_counter, before);
_rows_read_number += before;
_bytes_read += ck->bytes_usage();
ExecNode::eval_conjuncts(_conjunct_ctxs, ck);

View File

@ -50,6 +50,8 @@ public:
int64_t raw_rows_read() const override;
int64_t num_rows_read() const override;
int64_t num_bytes_read() const override;
int64_t cpu_time_spent() const override;
private:
const ESDataSourceProvider* _provider;
@ -74,6 +76,8 @@ private:
bool _batch_eof = false;
int64_t _rows_read_number = 0;
int64_t _rows_return_number = 0;
int64_t _bytes_read = 0;
int64_t _cpu_time_ns = 0;
ESScanReader* _es_reader = nullptr;
std::unique_ptr<vectorized::ScrollParser> _es_scroll_parser;
@ -82,7 +86,6 @@ private:
RuntimeProfile::Counter* _read_timer = nullptr;
RuntimeProfile::Counter* _materialize_timer = nullptr;
RuntimeProfile::Counter* _rows_read_counter = nullptr;
// =========================
Status _build_conjuncts();

View File

@ -143,6 +143,9 @@ void HiveDataSource::_init_tuples_and_slots(RuntimeState* state) {
if (hdfs_scan_node.__isset.hive_column_names) {
_hive_column_names = hdfs_scan_node.hive_column_names;
}
if (hdfs_scan_node.__isset.case_sensitive) {
_case_sensitive = hdfs_scan_node.case_sensitive;
}
}
void HiveDataSource::_decompose_conjunct_ctxs() {
@ -177,7 +180,6 @@ void HiveDataSource::_init_counter(RuntimeState* state) {
const auto& hdfs_scan_node = _provider->_hdfs_scan_node;
_profile.runtime_profile = _runtime_profile;
_profile.pool = _pool;
_profile.rows_read_counter = ADD_COUNTER(_runtime_profile, "RowsRead", TUnit::UNIT);
_profile.bytes_read_counter = ADD_COUNTER(_runtime_profile, "BytesRead", TUnit::BYTES);
@ -242,6 +244,7 @@ Status HiveDataSource::_init_scanner(RuntimeState* state) {
scanner_params.min_max_conjunct_ctxs = _min_max_conjunct_ctxs;
scanner_params.min_max_tuple_desc = _min_max_tuple_desc;
scanner_params.hive_column_names = &_hive_column_names;
scanner_params.case_sensitive = _case_sensitive;
scanner_params.profile = &_profile;
scanner_params.open_limit = nullptr;
@ -292,6 +295,14 @@ int64_t HiveDataSource::num_rows_read() const {
if (_scanner == nullptr) return 0;
return _scanner->num_rows_read();
}
int64_t HiveDataSource::num_bytes_read() const {
if (_scanner == nullptr) return 0;
return _scanner->num_bytes_read();
}
int64_t HiveDataSource::cpu_time_spent() const {
if (_scanner == nullptr) return 0;
return _scanner->cpu_time_spent();
}
} // namespace connector
} // namespace starrocks

View File

@ -43,6 +43,8 @@ public:
int64_t raw_rows_read() const override;
int64_t num_rows_read() const override;
int64_t num_bytes_read() const override;
int64_t cpu_time_spent() const override;
private:
const HiveDataSourceProvider* _provider;
@ -98,6 +100,7 @@ private:
std::vector<std::string> _hive_column_names;
const LakeTableDescriptor* _lake_table = nullptr;
bool _case_sensitive = false;
// ======================================
// The following are profile metrics
@ -105,4 +108,4 @@ private:
};
} // namespace connector
} // namespace starrocks
} // namespace starrocks

View File

@ -83,6 +83,7 @@ Status JDBCDataSource::get_next(RuntimeState* state, vectorized::ChunkPtr* chunk
return Status::EndOfFile("");
}
_rows_read += (*chunk)->num_rows();
_bytes_read += (*chunk)->bytes_usage();
return Status::OK();
}
@ -92,6 +93,13 @@ int64_t JDBCDataSource::raw_rows_read() const {
int64_t JDBCDataSource::num_rows_read() const {
return _rows_read;
}
int64_t JDBCDataSource::num_bytes_read() const {
return _bytes_read;
}
int64_t JDBCDataSource::cpu_time_spent() const {
// TODO: calculte the real cputime
return 0;
}
Status JDBCDataSource::_create_scanner(RuntimeState* state) {
const TJDBCScanNode& jdbc_scan_node = _provider->_jdbc_scan_node;

View File

@ -49,6 +49,8 @@ public:
int64_t raw_rows_read() const override;
int64_t num_rows_read() const override;
int64_t num_bytes_read() const override;
int64_t cpu_time_spent() const override;
private:
Status _create_scanner(RuntimeState* state);
@ -60,6 +62,7 @@ private:
RuntimeState* _runtime_state = nullptr;
vectorized::JDBCScanner* _scanner = nullptr;
int64_t _rows_read = 0;
int64_t _bytes_read = 0;
};
} // namespace connector

View File

@ -230,6 +230,7 @@ Status MySQLDataSource::get_next(RuntimeState* state, vectorized::ChunkPtr* chun
++row_num;
RETURN_IF_ERROR(fill_chunk(chunk, data, length));
++_rows_read;
_bytes_read += (*chunk)->bytes_usage();
}
}
@ -241,11 +242,21 @@ int64_t MySQLDataSource::num_rows_read() const {
return _rows_read;
}
int64_t MySQLDataSource::num_bytes_read() const {
return _bytes_read;
}
int64_t MySQLDataSource::cpu_time_spent() const {
return _cpu_time_spent_ns;
}
void MySQLDataSource::close(RuntimeState* state) {
SCOPED_TIMER(_runtime_profile->total_time_counter());
}
Status MySQLDataSource::fill_chunk(vectorized::ChunkPtr* chunk, char** data, size_t* length) {
SCOPED_RAW_TIMER(&_cpu_time_spent_ns);
int materialized_col_idx = -1;
for (size_t col_idx = 0; col_idx < _slot_num; ++col_idx) {
SlotDescriptor* slot_desc = _tuple_desc->slots()[col_idx];

View File

@ -48,6 +48,8 @@ public:
int64_t raw_rows_read() const override;
int64_t num_rows_read() const override;
int64_t num_bytes_read() const override;
int64_t cpu_time_spent() const override;
private:
const MySQLDataSourceProvider* _provider;
@ -75,6 +77,8 @@ private:
std::unique_ptr<MysqlScanner> _mysql_scanner;
int64_t _rows_read = 0;
int64_t _bytes_read = 0;
int64_t _cpu_time_spent_ns = 0;
Status fill_chunk(vectorized::ChunkPtr* chunk, char** data, size_t* length);

View File

@ -146,6 +146,7 @@ set(EXEC_FILES
pipeline/scan/olap_scan_context.cpp
pipeline/scan/connector_scan_operator.cpp
pipeline/scan/morsel.cpp
pipeline/scan/chunk_buffer_limiter.cpp
pipeline/select_operator.cpp
pipeline/crossjoin/cross_join_context.cpp
pipeline/crossjoin/cross_join_right_sink_operator.cpp
@ -190,6 +191,7 @@ set(EXEC_FILES
pipeline/set/intersect_build_sink_operator.cpp
pipeline/set/intersect_probe_sink_operator.cpp
pipeline/set/intersect_output_source_operator.cpp
pipeline/chunk_accumulate_operator.cpp
workgroup/work_group.cpp
workgroup/scan_executor.cpp
workgroup/scan_task_queue.cpp

View File

@ -22,6 +22,7 @@
#include "exec/exchange_node.h"
#include "column/chunk.h"
#include "exec/pipeline/chunk_accumulate_operator.h"
#include "exec/pipeline/exchange/exchange_merge_sort_source_operator.h"
#include "exec/pipeline/exchange/exchange_source_operator.h"
#include "exec/pipeline/limit_operator.h"
@ -234,15 +235,13 @@ void ExchangeNode::debug_string(int indentation_level, std::stringstream* out) c
pipeline::OpFactories ExchangeNode::decompose_to_pipeline(pipeline::PipelineBuilderContext* context) {
using namespace pipeline;
OpFactories operators;
if (!_is_merging) {
auto exchange_source_op = std::make_shared<ExchangeSourceOperatorFactory>(
context->next_operator_id(), id(), _texchange_node, _num_senders, _input_row_desc);
exchange_source_op->set_degree_of_parallelism(context->degree_of_parallelism());
operators.emplace_back(exchange_source_op);
if (limit() != -1) {
operators.emplace_back(std::make_shared<LimitOperatorFactory>(context->next_operator_id(), id(), limit()));
}
} else {
auto exchange_merge_sort_source_operator = std::make_shared<ExchangeMergeSortSourceOperatorFactory>(
context->next_operator_id(), id(), _num_senders, _input_row_desc, &_sort_exec_exprs, _is_asc_order,
@ -250,10 +249,16 @@ pipeline::OpFactories ExchangeNode::decompose_to_pipeline(pipeline::PipelineBuil
exchange_merge_sort_source_operator->set_degree_of_parallelism(1);
operators.emplace_back(std::move(exchange_merge_sort_source_operator));
}
// Create a shared RefCountedRuntimeFilterCollector
auto&& rc_rf_probe_collector = std::make_shared<RcRfProbeCollector>(1, std::move(this->runtime_filter_collector()));
// Initialize OperatorFactory's fields involving runtime filters.
this->init_runtime_filter_for_operator(operators.back().get(), context, rc_rf_probe_collector);
if (operators.back()->has_runtime_filters()) {
operators.emplace_back(std::make_shared<ChunkAccumulateOperatorFactory>(context->next_operator_id(), id()));
}
if (limit() != -1) {
operators.emplace_back(std::make_shared<LimitOperatorFactory>(context->next_operator_id(), id(), limit()));
}

View File

@ -313,7 +313,7 @@ Status ExecNode::close(RuntimeState* state) {
return Status::OK();
}
_is_closed = true;
RETURN_IF_ERROR(exec_debug_action(TExecNodePhase::CLOSE));
exec_debug_action(TExecNodePhase::CLOSE);
if (_rows_returned_counter != nullptr) {
COUNTER_SET(_rows_returned_counter, _num_rows_returned);

View File

@ -30,7 +30,7 @@
namespace starrocks {
MysqlScanner::MysqlScanner(const MysqlScannerParam& param)
: _my_param(param), _my_conn(nullptr), _my_result(nullptr), _is_open(false), _field_num(0) {}
: _my_param(param), _my_conn(nullptr), _my_result(nullptr), _opened(false), _field_num(0) {}
MysqlScanner::~MysqlScanner() {
if (_my_result) {
@ -50,7 +50,7 @@ MysqlScanner::~MysqlScanner() {
}
Status MysqlScanner::open() {
if (_is_open) {
if (_opened) {
LOG(INFO) << "this scanner already opened";
return Status::OK();
}
@ -77,13 +77,13 @@ Status MysqlScanner::open() {
return Status::InternalError("mysql set character set failed.");
}
_is_open = true;
_opened = true;
return Status::OK();
}
Status MysqlScanner::query(const std::string& query) {
if (!_is_open) {
if (!_opened) {
return Status::InternalError("Query before open.");
}
@ -118,7 +118,7 @@ Status MysqlScanner::query(const std::string& table, const std::vector<std::stri
const std::vector<std::string>& filters,
const std::unordered_map<std::string, std::vector<std::string>>& filters_in,
std::unordered_map<std::string, bool>& filters_null_in_set, int64_t limit) {
if (!_is_open) {
if (!_opened) {
return Status::InternalError("Query before open.");
}
@ -191,7 +191,7 @@ Status MysqlScanner::query(const std::string& table, const std::vector<std::stri
}
Status MysqlScanner::get_next_row(char*** buf, unsigned long** lengths, bool* eos) {
if (!_is_open) {
if (!_opened) {
return Status::InternalError("GetNextRow before open.");
}

View File

@ -72,7 +72,7 @@ private:
__StarRocksMysql* _my_conn;
__StarRocksMysqlRes* _my_result;
std::string _sql_str;
bool _is_open;
bool _opened;
int _field_num;
};

View File

@ -53,7 +53,7 @@ StatusOr<vectorized::ChunkPtr> AggregateBlockingSinkOperator::pull_chunk(Runtime
}
Status AggregateBlockingSinkOperator::push_chunk(RuntimeState* state, const vectorized::ChunkPtr& chunk) {
_aggregator->evaluate_exprs(chunk.get());
RETURN_IF_ERROR(_aggregator->evaluate_exprs(chunk.get()));
bool agg_group_by_with_limit =
(!_aggregator->is_none_group_by_exprs() && // has group by
@ -75,8 +75,7 @@ Status AggregateBlockingSinkOperator::push_chunk(RuntimeState* state, const vect
APPLY_FOR_AGG_VARIANT_ALL(HASH_MAP_METHOD)
#undef HASH_MAP_METHOD
_mem_tracker->set(_aggregator->hash_map_variant().memory_usage() +
_aggregator->mem_pool()->total_reserved_bytes());
_mem_tracker->set(_aggregator->hash_map_variant().reserved_memory_usage(_aggregator->mem_pool()));
TRY_CATCH_BAD_ALLOC(_aggregator->try_convert_to_two_level_map());
}
if (_aggregator->is_none_group_by_exprs()) {

View File

@ -47,7 +47,7 @@ StatusOr<vectorized::ChunkPtr> AggregateDistinctBlockingSinkOperator::pull_chunk
Status AggregateDistinctBlockingSinkOperator::push_chunk(RuntimeState* state, const vectorized::ChunkPtr& chunk) {
DCHECK_LE(chunk->num_rows(), state->chunk_size());
_aggregator->evaluate_exprs(chunk.get());
RETURN_IF_ERROR(_aggregator->evaluate_exprs(chunk.get()));
{
SCOPED_TIMER(_aggregator->agg_compute_timer());
@ -63,8 +63,7 @@ Status AggregateDistinctBlockingSinkOperator::push_chunk(RuntimeState* state, co
APPLY_FOR_AGG_VARIANT_ALL(HASH_SET_METHOD)
#undef HASH_SET_METHOD
_mem_tracker->set(_aggregator->hash_set_variant().memory_usage() +
_aggregator->mem_pool()->total_reserved_bytes());
_mem_tracker->set(_aggregator->hash_set_variant().reserved_memory_usage(_aggregator->mem_pool()));
TRY_CATCH_BAD_ALLOC(_aggregator->try_convert_to_two_level_set());
_aggregator->update_num_input_rows(chunk->num_rows());

View File

@ -38,7 +38,7 @@ Status AggregateDistinctStreamingSinkOperator::push_chunk(RuntimeState* state, c
_aggregator->update_num_input_rows(chunk_size);
COUNTER_SET(_aggregator->input_row_count(), _aggregator->num_input_rows());
_aggregator->evaluate_exprs(chunk.get());
RETURN_IF_ERROR(_aggregator->evaluate_exprs(chunk.get()));
if (_aggregator->streaming_preaggregation_mode() == TStreamingPreaggregationMode::FORCE_STREAMING) {
return _push_chunk_by_force_streaming();
@ -74,7 +74,7 @@ Status AggregateDistinctStreamingSinkOperator::_push_chunk_by_force_preaggregati
COUNTER_SET(_aggregator->hash_table_size(), (int64_t)_aggregator->hash_set_variant().size());
_mem_tracker->set(_aggregator->hash_set_variant().memory_usage() + _aggregator->mem_pool()->total_reserved_bytes());
_mem_tracker->set(_aggregator->hash_set_variant().reserved_memory_usage(_aggregator->mem_pool()));
TRY_CATCH_BAD_ALLOC(_aggregator->try_convert_to_two_level_set());
return Status::OK();
@ -85,9 +85,9 @@ Status AggregateDistinctStreamingSinkOperator::_push_chunk_by_auto(const size_t
size_t real_capacity = _aggregator->hash_set_variant().capacity() - _aggregator->hash_set_variant().capacity() / 8;
size_t remain_size = real_capacity - _aggregator->hash_set_variant().size();
bool ht_needs_expansion = remain_size < chunk_size;
size_t allocated_bytes = _aggregator->hash_set_variant().allocated_memory_usage(_aggregator->mem_pool());
if (!ht_needs_expansion ||
_aggregator->should_expand_preagg_hash_tables(_aggregator->num_input_rows(), chunk_size,
_aggregator->mem_pool()->total_allocated_bytes(),
_aggregator->should_expand_preagg_hash_tables(_aggregator->num_input_rows(), chunk_size, allocated_bytes,
_aggregator->hash_set_variant().size())) {
// hash table is not full or allow expand the hash table according reduction rate
SCOPED_TIMER(_aggregator->agg_compute_timer());
@ -106,8 +106,7 @@ Status AggregateDistinctStreamingSinkOperator::_push_chunk_by_auto(const size_t
COUNTER_SET(_aggregator->hash_table_size(), (int64_t)_aggregator->hash_set_variant().size());
_mem_tracker->set(_aggregator->hash_set_variant().memory_usage() +
_aggregator->mem_pool()->total_reserved_bytes());
_mem_tracker->set(_aggregator->hash_set_variant().reserved_memory_usage(_aggregator->mem_pool()));
TRY_CATCH_BAD_ALLOC(_aggregator->try_convert_to_two_level_set());
} else {
{

View File

@ -38,7 +38,7 @@ Status AggregateStreamingSinkOperator::push_chunk(RuntimeState* state, const vec
_aggregator->update_num_input_rows(chunk_size);
COUNTER_SET(_aggregator->input_row_count(), _aggregator->num_input_rows());
_aggregator->evaluate_exprs(chunk.get());
RETURN_IF_ERROR(_aggregator->evaluate_exprs(chunk.get()));
if (_aggregator->streaming_preaggregation_mode() == TStreamingPreaggregationMode::FORCE_STREAMING) {
return _push_chunk_by_force_streaming();
@ -78,7 +78,7 @@ Status AggregateStreamingSinkOperator::_push_chunk_by_force_preaggregation(const
_aggregator->compute_batch_agg_states(chunk_size);
}
_mem_tracker->set(_aggregator->hash_map_variant().memory_usage() + _aggregator->mem_pool()->total_reserved_bytes());
_mem_tracker->set(_aggregator->hash_map_variant().reserved_memory_usage(_aggregator->mem_pool()));
TRY_CATCH_BAD_ALLOC(_aggregator->try_convert_to_two_level_map());
COUNTER_SET(_aggregator->hash_table_size(), (int64_t)_aggregator->hash_map_variant().size());
@ -91,9 +91,9 @@ Status AggregateStreamingSinkOperator::_push_chunk_by_auto(const size_t chunk_si
size_t real_capacity = _aggregator->hash_map_variant().capacity() - _aggregator->hash_map_variant().capacity() / 8;
size_t remain_size = real_capacity - _aggregator->hash_map_variant().size();
bool ht_needs_expansion = remain_size < chunk_size;
size_t allocated_bytes = _aggregator->hash_map_variant().allocated_memory_usage(_aggregator->mem_pool());
if (!ht_needs_expansion ||
_aggregator->should_expand_preagg_hash_tables(_aggregator->num_input_rows(), chunk_size,
_aggregator->mem_pool()->total_allocated_bytes(),
_aggregator->should_expand_preagg_hash_tables(_aggregator->num_input_rows(), chunk_size, allocated_bytes,
_aggregator->hash_map_variant().size())) {
// hash table is not full or allow expand the hash table according reduction rate
SCOPED_TIMER(_aggregator->agg_compute_timer());
@ -116,8 +116,7 @@ Status AggregateStreamingSinkOperator::_push_chunk_by_auto(const size_t chunk_si
_aggregator->compute_batch_agg_states(chunk_size);
}
_mem_tracker->set(_aggregator->hash_map_variant().memory_usage() +
_aggregator->mem_pool()->total_reserved_bytes());
_mem_tracker->set(_aggregator->hash_map_variant().reserved_memory_usage(_aggregator->mem_pool()));
TRY_CATCH_BAD_ALLOC(_aggregator->try_convert_to_two_level_map());
COUNTER_SET(_aggregator->hash_table_size(), (int64_t)_aggregator->hash_map_variant().size());

View File

@ -27,6 +27,7 @@ Status AssertNumRowsOperator::prepare(RuntimeState* state) {
}
void AssertNumRowsOperator::close(RuntimeState* state) {
_cur_chunk.reset();
Operator::close(state);
}

View File

@ -0,0 +1,55 @@
// This file is licensed under the Elastic License 2.0. Copyright 2021 StarRocks Limited.
#include "exec/pipeline/chunk_accumulate_operator.h"
#include "column/chunk.h"
#include "runtime/runtime_state.h"
namespace starrocks {
namespace pipeline {
Status ChunkAccumulateOperator::push_chunk(RuntimeState* state, const vectorized::ChunkPtr& chunk) {
DCHECK(_out_chunk == nullptr);
if (_in_chunk == nullptr) {
_in_chunk = chunk;
} else if (_in_chunk->num_rows() + chunk->num_rows() > state->chunk_size()) {
_out_chunk = std::move(_in_chunk);
_in_chunk = chunk;
} else {
_in_chunk->append(*chunk);
}
if (_out_chunk == nullptr && (_in_chunk->num_rows() >= state->chunk_size() * LOW_WATERMARK_ROWS_RATE ||
_in_chunk->memory_usage() >= LOW_WATERMARK_BYTES)) {
_out_chunk = std::move(_in_chunk);
}
return Status::OK();
}
StatusOr<vectorized::ChunkPtr> ChunkAccumulateOperator::pull_chunk(RuntimeState*) {
// If there isn't more input chunk and _out_chunk has been outputted, output _in_chunk this time.
if (_is_finished && _out_chunk == nullptr) {
return std::move(_in_chunk);
}
return std::move(_out_chunk);
}
Status ChunkAccumulateOperator::set_finishing(RuntimeState* state) {
_is_finished = true;
return Status::OK();
}
Status ChunkAccumulateOperator::set_finished(RuntimeState*) {
_is_finished = true;
_in_chunk.reset();
_out_chunk.reset();
return Status::OK();
}
} // namespace pipeline
} // namespace starrocks

View File

@ -0,0 +1,53 @@
// This file is licensed under the Elastic License 2.0. Copyright 2021-present, StarRocks Limited.
#pragma once
#include "exec/pipeline/operator.h"
namespace starrocks {
class RuntimeState;
namespace pipeline {
// Accumulate chunks and output a chunk, until the number of rows of the input chunks is large enough.
class ChunkAccumulateOperator final : public Operator {
public:
ChunkAccumulateOperator(OperatorFactory* factory, int32_t id, int32_t plan_node_id, int32_t driver_sequence)
: Operator(factory, id, "chunk_accumulate", plan_node_id, driver_sequence) {}
~ChunkAccumulateOperator() override = default;
Status push_chunk(RuntimeState* state, const vectorized::ChunkPtr& chunk) override;
StatusOr<vectorized::ChunkPtr> pull_chunk(RuntimeState* state) override;
bool has_output() const override { return _out_chunk != nullptr || (_is_finished && _in_chunk != nullptr); }
bool need_input() const override { return !_is_finished && _out_chunk == nullptr; }
bool is_finished() const override { return _is_finished && _in_chunk == nullptr && _out_chunk == nullptr; }
Status set_finishing(RuntimeState* state) override;
Status set_finished(RuntimeState* state) override;
private:
static constexpr double LOW_WATERMARK_ROWS_RATE = 0.75; // 0.75 * chunk_size
static constexpr size_t LOW_WATERMARK_BYTES = 256 * 1024 * 1024; // 256MB.
bool _is_finished = false;
vectorized::ChunkPtr _in_chunk = nullptr;
vectorized::ChunkPtr _out_chunk = nullptr;
};
class ChunkAccumulateOperatorFactory final : public OperatorFactory {
public:
ChunkAccumulateOperatorFactory(int32_t id, int32_t plan_node_id)
: OperatorFactory(id, "chunk_accumulate", plan_node_id) {}
~ChunkAccumulateOperatorFactory() override = default;
OperatorPtr create(int32_t degree_of_parallelism, int32_t driver_sequence) override {
return std::make_shared<ChunkAccumulateOperator>(this, _id, _plan_node_id, driver_sequence);
}
};
} // namespace pipeline
} // namespace starrocks

View File

@ -12,6 +12,11 @@
#include "runtime/runtime_state.h"
namespace starrocks::pipeline {
void CrossJoinContext::close(RuntimeState* state) {
_build_chunks.clear();
}
Status CrossJoinContext::_init_runtime_filter(RuntimeState* state) {
vectorized::ChunkPtr one_row_chunk = nullptr;
size_t num_rows = 0;

View File

@ -35,12 +35,9 @@ public:
_rf_hub(params.rf_hub),
_rf_descs(std::move(params.rf_descs)) {}
void close(RuntimeState* state) override {}
void close(RuntimeState* state) override;
bool is_build_chunk_empty() const {
return std::all_of(_build_chunks.begin(), _build_chunks.end(),
[](const vectorized::ChunkPtr& chunk) { return chunk == nullptr || chunk->is_empty(); });
}
bool is_build_chunk_empty() const { return _is_build_chunk_empty; }
int32_t num_build_chunks() const { return _num_right_sinkers; }
@ -53,6 +50,9 @@ public:
Status finish_one_right_sinker(RuntimeState* state) {
if (_num_right_sinkers - 1 == _num_finished_right_sinkers.fetch_add(1)) {
RETURN_IF_ERROR(_init_runtime_filter(state));
_is_build_chunk_empty = std::all_of(
_build_chunks.begin(), _build_chunks.end(),
[](const vectorized::ChunkPtr& chunk) { return chunk == nullptr || chunk->is_empty(); });
_all_right_finished.store(true, std::memory_order_release);
}
return Status::OK();
@ -75,6 +75,7 @@ private:
// _build_chunks[i] contains all the rows from i-th CrossJoinRightSinkOperator.
std::vector<vectorized::ChunkPtr> _build_chunks;
bool _is_build_chunk_empty = false;
// finished flags
std::atomic_bool _all_right_finished = false;

View File

@ -13,6 +13,7 @@ Status DictDecodeOperator::prepare(RuntimeState* state) {
}
void DictDecodeOperator::close(RuntimeState* state) {
_cur_chunk.reset();
Operator::close(state);
}

View File

@ -9,6 +9,7 @@
#include <memory>
#include <random>
#include "common/config.h"
#include "exec/pipeline/exchange/sink_buffer.h"
#include "exprs/expr.h"
#include "gen_cpp/Types_types.h"
@ -59,7 +60,8 @@ public:
// Channel will sent input request directly without batch it.
// This function is only used when broadcast, because request can be reused
// by all the channels.
Status send_chunk_request(PTransmitChunkParamsPtr chunk_request, const butil::IOBuf& attachment);
Status send_chunk_request(PTransmitChunkParamsPtr chunk_request, const butil::IOBuf& attachment,
int64_t attachment_physical_bytes);
// Used when doing shuffle.
// This function will copy selective rows in chunks to batch.
@ -213,12 +215,13 @@ Status ExchangeSinkOperator::Channel::send_one_chunk(const vectorized::Chunk* ch
// Try to accumulate enough bytes before sending a RPC. When eos is true we should send
// last packet
if (_current_request_bytes > _parent->_request_bytes_threshold || eos) {
if (_current_request_bytes > config::max_transmit_batched_bytes || eos) {
_chunk_request->set_eos(eos);
_chunk_request->set_use_pass_through(_use_pass_through);
butil::IOBuf attachment;
_parent->construct_brpc_attachment(_chunk_request, attachment);
TransmitChunkInfo info = {this->_fragment_instance_id, _brpc_stub, std::move(_chunk_request), attachment};
int64_t attachment_physical_bytes = _parent->construct_brpc_attachment(_chunk_request, attachment);
TransmitChunkInfo info = {this->_fragment_instance_id, _brpc_stub, std::move(_chunk_request), attachment,
attachment_physical_bytes};
_parent->_buffer->add_request(info);
_current_request_bytes = 0;
_chunk_request.reset();
@ -229,14 +232,16 @@ Status ExchangeSinkOperator::Channel::send_one_chunk(const vectorized::Chunk* ch
}
Status ExchangeSinkOperator::Channel::send_chunk_request(PTransmitChunkParamsPtr chunk_request,
const butil::IOBuf& attachment) {
const butil::IOBuf& attachment,
int64_t attachment_physical_bytes) {
chunk_request->set_node_id(_dest_node_id);
chunk_request->set_sender_id(_parent->_sender_id);
chunk_request->set_be_number(_parent->_be_number);
chunk_request->set_eos(false);
chunk_request->set_use_pass_through(_use_pass_through);
TransmitChunkInfo info = {this->_fragment_instance_id, _brpc_stub, std::move(chunk_request), attachment};
TransmitChunkInfo info = {this->_fragment_instance_id, _brpc_stub, std::move(chunk_request), attachment,
attachment_physical_bytes};
_parent->_buffer->add_request(info);
return Status::OK();
@ -367,11 +372,11 @@ bool ExchangeSinkOperator::is_finished() const {
}
bool ExchangeSinkOperator::need_input() const {
return !is_finished() && !_buffer->is_full();
return !is_finished() && _buffer != nullptr && !_buffer->is_full();
}
bool ExchangeSinkOperator::pending_finish() const {
return !_buffer->is_finished();
return _buffer != nullptr && !_buffer->is_finished();
}
Status ExchangeSinkOperator::set_cancelled(RuntimeState* state) {
@ -425,13 +430,14 @@ Status ExchangeSinkOperator::push_chunk(RuntimeState* state, const vectorized::C
RETURN_IF_ERROR(serialize_chunk(send_chunk, pchunk, &_is_first_chunk, _channels.size())));
_current_request_bytes += pchunk->data().size();
// 3. if request bytes exceede the threshold, send current request
if (_current_request_bytes > _request_bytes_threshold) {
if (_current_request_bytes > config::max_transmit_batched_bytes) {
butil::IOBuf attachment;
construct_brpc_attachment(_chunk_request, attachment);
int64_t attachment_physical_bytes = construct_brpc_attachment(_chunk_request, attachment);
for (auto idx : _channel_indices) {
if (!_channels[idx]->use_pass_through()) {
PTransmitChunkParamsPtr copy = std::make_shared<PTransmitChunkParams>(*_chunk_request);
RETURN_IF_ERROR(_channels[idx]->send_chunk_request(copy, attachment));
RETURN_IF_ERROR(
_channels[idx]->send_chunk_request(copy, attachment, attachment_physical_bytes));
}
}
_current_request_bytes = 0;
@ -525,10 +531,10 @@ Status ExchangeSinkOperator::set_finishing(RuntimeState* state) {
if (_chunk_request != nullptr) {
butil::IOBuf attachment;
construct_brpc_attachment(_chunk_request, attachment);
int64_t attachment_physical_bytes = construct_brpc_attachment(_chunk_request, attachment);
for (const auto& channel : _channels) {
PTransmitChunkParamsPtr copy = std::make_shared<PTransmitChunkParams>(*_chunk_request);
channel->send_chunk_request(copy, attachment);
channel->send_chunk_request(copy, attachment, attachment_physical_bytes);
}
_current_request_bytes = 0;
_chunk_request.reset();
@ -602,17 +608,25 @@ Status ExchangeSinkOperator::serialize_chunk(const vectorized::Chunk* src, Chunk
return Status::OK();
}
void ExchangeSinkOperator::construct_brpc_attachment(PTransmitChunkParamsPtr chunk_request, butil::IOBuf& attachment) {
int64_t ExchangeSinkOperator::construct_brpc_attachment(PTransmitChunkParamsPtr chunk_request,
butil::IOBuf& attachment) {
int64_t attachment_physical_bytes = 0;
for (int i = 0; i < chunk_request->chunks().size(); ++i) {
auto chunk = chunk_request->mutable_chunks(i);
chunk->set_data_size(chunk->data().size());
int64_t before_bytes = CurrentThread::current().get_consumed_bytes();
attachment.append(chunk->data());
attachment_physical_bytes += CurrentThread::current().get_consumed_bytes() - before_bytes;
chunk->clear_data();
// If the request is too big, free the memory in order to avoid OOM
if (_is_large_chunk(chunk->data_size())) {
chunk->mutable_data()->shrink_to_fit();
}
}
return attachment_physical_bytes;
}
ExchangeSinkOperatorFactory::ExchangeSinkOperatorFactory(
@ -653,6 +667,7 @@ Status ExchangeSinkOperatorFactory::prepare(RuntimeState* state) {
}
void ExchangeSinkOperatorFactory::close(RuntimeState* state) {
_buffer.reset();
Expr::close(_partition_expr_ctxs, state);
OperatorFactory::close(state);
}

View File

@ -63,7 +63,8 @@ public:
// For other chunk, only serialize the chunk data to ChunkPB.
Status serialize_chunk(const vectorized::Chunk* chunk, ChunkPB* dst, bool* is_first_chunk, int num_receivers = 1);
void construct_brpc_attachment(PTransmitChunkParamsPtr _chunk_request, butil::IOBuf& attachment);
// Return the physical bytes of attachment.
int64_t construct_brpc_attachment(PTransmitChunkParamsPtr _chunk_request, butil::IOBuf& attachment);
private:
bool _is_large_chunk(size_t sz) const {
@ -107,7 +108,6 @@ private:
// Only used when broadcast
PTransmitChunkParamsPtr _chunk_request;
size_t _current_request_bytes = 0;
size_t _request_bytes_threshold = config::max_transmit_batched_bytes;
bool _is_first_chunk = true;

View File

@ -86,7 +86,7 @@ Status PartitionExchanger::accept(const vectorized::ChunkPtr& chunk, const int32
// and used later in pull_chunk() of source operator. If we reuse partition_row_indexes in partitioner,
// it will be overwritten by the next time calling partitioner.partition_chunk().
std::shared_ptr<std::vector<uint32_t>> partition_row_indexes = std::make_shared<std::vector<uint32_t>>(num_rows);
partitioner.partition_chunk(chunk, *partition_row_indexes);
RETURN_IF_ERROR(partitioner.partition_chunk(chunk, *partition_row_indexes));
for (size_t i = 0; i < _source->get_sources().size(); ++i) {
size_t from = partitioner.partition_begin_offset(i);

View File

@ -4,6 +4,11 @@
#include <chrono>
DIAGNOSTIC_PUSH
DIAGNOSTIC_IGNORE("-Wclass-memaccess")
#include <bthread/bthread.h>
DIAGNOSTIC_POP
#include "fmt/core.h"
#include "util/time.h"
#include "util/uid_util.h"
@ -86,14 +91,19 @@ bool SinkBuffer::is_full() const {
for (auto& [_, buffer] : _buffers) {
buffer_size += buffer.size();
}
bool is_full = buffer_size > max_buffer_size;
const bool is_full = buffer_size > max_buffer_size;
if (is_full && _last_full_timestamp == -1) {
_last_full_timestamp = MonotonicNanos();
int64_t last_full_timestamp = _last_full_timestamp;
int64_t full_time = _full_time;
if (is_full && last_full_timestamp == -1) {
_last_full_timestamp.compare_exchange_weak(last_full_timestamp, MonotonicNanos());
}
if (!is_full && _last_full_timestamp != -1) {
_full_time += (MonotonicNanos() - _last_full_timestamp);
_last_full_timestamp = -1;
if (!is_full && last_full_timestamp != -1) {
// The following two update operations cannot guarantee atomicity as a whole without lock
// But we can accept bias in estimatation
_full_time.compare_exchange_weak(full_time, full_time + (MonotonicNanos() - last_full_timestamp));
_last_full_timestamp.compare_exchange_weak(last_full_timestamp, -1);
}
return is_full;
@ -218,12 +228,17 @@ void SinkBuffer::_try_to_send_rpc(const TUniqueId& instance_id, std::function<vo
return;
}
TransmitChunkInfo request = buffer.front();
TransmitChunkInfo& request = buffer.front();
bool need_wait = false;
DeferOp pop_defer([&need_wait, &buffer]() {
DeferOp pop_defer([&need_wait, &buffer, mem_tracker = _mem_tracker]() {
if (need_wait) {
return;
}
// The request memory is acquired by ExchangeSinkOperator,
// so use the instance_mem_tracker passed from ExchangeSinkOperator to release memory.
// This must be invoked before decrease_defer desctructed to avoid sink_buffer and fragment_ctx released.
SCOPED_THREAD_LOCAL_MEM_TRACKER_SETTER(mem_tracker);
buffer.pop();
});
@ -311,10 +326,23 @@ void SinkBuffer::_try_to_send_rpc(const TUniqueId& instance_id, std::function<vo
++_total_in_flight_rpc;
++_num_in_flight_rpcs[instance_id.lo];
// Attachment will be released by process_mem_tracker in closure->Run() in bthread, when receiving the response,
// so decrease the memory usage of attachment from instance_mem_tracker immediately before sending the request.
_mem_tracker->release(request.attachment_physical_bytes);
ExecEnv::GetInstance()->process_mem_tracker()->consume(request.attachment_physical_bytes);
closure->cntl.Reset();
closure->cntl.set_timeout_ms(_brpc_timeout_ms);
closure->cntl.request_attachment().append(request.attachment);
request.brpc_stub->transmit_chunk(&closure->cntl, request.params.get(), &closure->result, closure);
if (bthread_self()) {
request.brpc_stub->transmit_chunk(&closure->cntl, request.params.get(), &closure->result, closure);
} else {
// When the driver worker thread sends request and creates the protobuf request,
// also use process_mem_tracker to record the memory of the protobuf request.
SCOPED_THREAD_LOCAL_MEM_TRACKER_SETTER(nullptr);
request.brpc_stub->transmit_chunk(&closure->cntl, request.params.get(), &closure->result, closure);
}
return;
}

View File

@ -32,6 +32,7 @@ struct TransmitChunkInfo {
doris::PBackendService_Stub* brpc_stub;
PTransmitChunkParamsPtr params;
butil::IOBuf attachment;
int64_t attachment_physical_bytes;
};
struct ClosureContext {
@ -104,7 +105,7 @@ private:
int64_t _network_time();
FragmentContext* _fragment_ctx;
const MemTracker* _mem_tracker;
MemTracker* const _mem_tracker;
const int32_t _brpc_timeout_ms;
const bool _is_dest_merge;
@ -157,8 +158,8 @@ private:
std::atomic<int64_t> _request_sent = 0;
int64_t _pending_timestamp = -1;
mutable int64_t _last_full_timestamp = -1;
mutable int64_t _full_time = 0;
mutable std::atomic<int64_t> _last_full_timestamp = -1;
mutable std::atomic<int64_t> _full_time = 0;
}; // namespace starrocks::pipeline
} // namespace starrocks::pipeline

View File

@ -80,7 +80,9 @@ void FragmentContext::prepare_pass_through_chunk_buffer() {
_runtime_state->exec_env()->stream_mgr()->prepare_pass_through_chunk_buffer(_query_id);
}
void FragmentContext::destroy_pass_through_chunk_buffer() {
_runtime_state->exec_env()->stream_mgr()->destroy_pass_through_chunk_buffer(_query_id);
if (_runtime_state) {
_runtime_state->exec_env()->stream_mgr()->destroy_pass_through_chunk_buffer(_query_id);
}
}
} // namespace starrocks::pipeline

View File

@ -60,7 +60,6 @@ Status FragmentExecutor::_prepare_query_ctx(ExecEnv* exec_env, const TExecPlanFr
const auto& params = request.params;
const auto& query_id = params.query_id;
const auto& fragment_instance_id = params.fragment_instance_id;
const auto& query_options = request.query_options;
auto&& existing_query_ctx = exec_env->query_context_mgr()->get(query_id);
if (existing_query_ctx) {
@ -75,14 +74,12 @@ Status FragmentExecutor::_prepare_query_ctx(ExecEnv* exec_env, const TExecPlanFr
if (params.__isset.instances_number) {
_query_ctx->set_total_fragments(params.instances_number);
}
if (query_options.__isset.query_timeout) {
_query_ctx->set_expire_seconds(std::max<int>(query_options.query_timeout, 1));
} else {
_query_ctx->set_expire_seconds(300);
}
_query_ctx->set_delivery_expire_seconds(_calc_delivery_expired_seconds(request));
_query_ctx->set_query_expire_seconds(_calc_query_expired_seconds(request));
// initialize query's deadline
_query_ctx->extend_lifetime();
_query_ctx->extend_delivery_lifetime();
_query_ctx->extend_query_lifetime();
return Status::OK();
}
@ -204,6 +201,33 @@ int32_t FragmentExecutor::_calc_dop(ExecEnv* exec_env, const TExecPlanFragmentPa
return exec_env->calc_pipeline_dop(degree_of_parallelism);
}
int FragmentExecutor::_calc_delivery_expired_seconds(const TExecPlanFragmentParams& request) const {
const auto& query_options = request.query_options;
int expired_seconds = QueryContext::DEFAULT_EXPIRE_SECONDS;
if (query_options.__isset.query_delivery_timeout) {
if (query_options.__isset.query_timeout) {
expired_seconds = std::min(query_options.query_timeout, query_options.query_delivery_timeout);
} else {
expired_seconds = query_options.query_delivery_timeout;
}
} else if (query_options.__isset.query_timeout) {
expired_seconds = query_options.query_timeout;
}
return std::max<int>(1, expired_seconds);
}
int FragmentExecutor::_calc_query_expired_seconds(const TExecPlanFragmentParams& request) const {
const auto& query_options = request.query_options;
if (query_options.__isset.query_timeout) {
return std::max<int>(1, query_options.query_timeout);
}
return QueryContext::DEFAULT_EXPIRE_SECONDS;
}
Status FragmentExecutor::_prepare_exec_plan(ExecEnv* exec_env, const TExecPlanFragmentParams& request) {
auto* runtime_state = _fragment_ctx->runtime_state();
auto* obj_pool = runtime_state->obj_pool();
@ -232,6 +256,7 @@ Status FragmentExecutor::_prepare_exec_plan(ExecEnv* exec_env, const TExecPlanFr
std::vector<TScanRangeParams> no_scan_ranges;
plan->collect_scan_nodes(&scan_nodes);
int64_t sum_scan_limit = 0;
MorselQueueMap& morsel_queues = _fragment_ctx->morsel_queues();
for (auto& i : scan_nodes) {
ScanNode* scan_node = down_cast<ScanNode*>(i);
@ -240,6 +265,22 @@ Status FragmentExecutor::_prepare_exec_plan(ExecEnv* exec_env, const TExecPlanFr
ASSIGN_OR_RETURN(MorselQueuePtr morsel_queue,
scan_node->convert_scan_range_to_morsel_queue(scan_ranges, scan_node->id(), request));
morsel_queues.emplace(scan_node->id(), std::move(morsel_queue));
if (scan_node->limit() > 0) {
sum_scan_limit += scan_node->limit();
}
}
int dop = exec_env->calc_pipeline_dop(request.pipeline_dop);
if (_wg && _wg->big_query_scan_rows_limit() > 0) {
// For SQL like: select * from xxx limit 5, the underlying scan_limit should be 5 * parallelism
// Otherwise this SQL would exceed the bigquery_rows_limit due to underlying IO parallelization
if (sum_scan_limit <= _wg->big_query_scan_rows_limit()) {
int parallelism = dop * ScanOperator::MAX_IO_TASKS_PER_OP;
int64_t parallel_scan_limit = sum_scan_limit * parallelism;
_query_ctx->set_scan_limit(parallel_scan_limit);
} else {
_query_ctx->set_scan_limit(_wg->big_query_scan_rows_limit());
}
}
return Status::OK();
@ -291,7 +332,8 @@ Status FragmentExecutor::_prepare_pipeline_driver(ExecEnv* exec_env, const TExec
auto source_id = pipeline->get_op_factories()[0]->plan_node_id();
DCHECK(morsel_queues.count(source_id));
auto& morsel_queue = morsel_queues[source_id];
DCHECK(morsel_queue->num_morsels() == 0 || cur_pipeline_dop <= morsel_queue->num_morsels());
DCHECK(morsel_queue->max_degree_of_parallelism() == 0 ||
cur_pipeline_dop <= morsel_queue->max_degree_of_parallelism());
for (size_t i = 0; i < cur_pipeline_dop; ++i) {
auto&& operators = pipeline->create_operators(cur_pipeline_dop, i);
@ -411,6 +453,8 @@ void FragmentExecutor::_fail_cleanup() {
if (_query_ctx) {
if (_fragment_ctx) {
_query_ctx->fragment_mgr()->unregister(_fragment_ctx->fragment_instance_id());
_fragment_ctx->destroy_pass_through_chunk_buffer();
_fragment_ctx.reset();
}
if (_query_ctx->count_down_fragments()) {
auto query_id = _query_ctx->query_id();

View File

@ -27,6 +27,8 @@ public:
private:
void _fail_cleanup();
int32_t _calc_dop(ExecEnv* exec_env, const TExecPlanFragmentParams& request) const;
int _calc_delivery_expired_seconds(const TExecPlanFragmentParams& request) const;
int _calc_query_expired_seconds(const TExecPlanFragmentParams& request) const;
// Several steps of prepare a fragment
// 1. query context

View File

@ -13,6 +13,7 @@
namespace starrocks::pipeline {
/// Operator.
const int32_t Operator::s_pseudo_plan_node_id_for_result_sink = -99;
const int32_t Operator::s_pseudo_plan_node_id_upper_bound = -100;
@ -114,6 +115,9 @@ Status Operator::eval_conjuncts_and_in_filters(const std::vector<ExprContext*>&
in_filters.end());
_conjuncts_and_in_filters_is_cached = true;
}
if (_cached_conjuncts_and_in_filters.empty()) {
return Status::OK();
}
if (chunk == nullptr || chunk->is_empty()) {
return Status::OK();
}
@ -166,7 +170,7 @@ void Operator::_init_rf_counters(bool init_bloom) {
void Operator::_init_conjuct_counters() {
if (_conjuncts_timer == nullptr) {
_conjuncts_timer = ADD_TIMER(_common_metrics, "JoinRuntimeFilterTime");
_conjuncts_timer = ADD_TIMER(_common_metrics, "ConjunctsTime");
_conjuncts_input_counter = ADD_COUNTER(_common_metrics, "ConjunctsInputRows", TUnit::UNIT);
_conjuncts_output_counter = ADD_COUNTER(_common_metrics, "ConjunctsOutputRows", TUnit::UNIT);
_conjuncts_eval_counter = ADD_COUNTER(_common_metrics, "ConjunctsEvaluate", TUnit::UNIT);
@ -201,6 +205,7 @@ Status OperatorFactory::prepare(RuntimeState* state) {
return Status::OK();
}
/// OperatorFactory.
void OperatorFactory::close(RuntimeState* state) {
if (_runtime_filter_collector) {
_runtime_filter_collector->close(state);
@ -224,4 +229,18 @@ void OperatorFactory::_prepare_runtime_in_filters(RuntimeState* state) {
}
}
bool OperatorFactory::has_runtime_filters() const {
// Check runtime in-filters.
if (!_rf_waiting_set.empty()) {
return true;
}
// Check runtime bloom-filters.
if (_runtime_filter_collector == nullptr) {
return false;
}
auto* global_rf_collector = _runtime_filter_collector->get_rf_probe_collector();
return global_rf_collector != nullptr && !global_rf_collector->descriptors().empty();
}
} // namespace starrocks::pipeline

View File

@ -268,6 +268,10 @@ public:
RowDescriptor* row_desc() { return &_row_desc; }
// Whether it has any runtime in-filter or bloom-filter.
// MUST be invoked after init_runtime_filter.
bool has_runtime_filters() const;
protected:
void _prepare_runtime_in_filters(RuntimeState* state);

View File

@ -160,6 +160,22 @@ MorselQueue* PipelineBuilderContext::morsel_queue_of_source_operator(const Sourc
return morsel_queues[source_id].get();
}
size_t PipelineBuilderContext::degree_of_parallelism_of_source_operator(int32_t source_node_id) const {
auto& morsel_queues = _fragment_context->morsel_queues();
auto it = morsel_queues.find(source_node_id);
if (it == morsel_queues.end()) {
return _degree_of_parallelism;
}
// The degree_of_parallelism of the SourceOperator with morsel is not more than the number of morsels
// If table is empty, then morsel size is zero and we still set degree of parallelism to 1
return std::min<size_t>(std::max<size_t>(1, it->second->max_degree_of_parallelism()), _degree_of_parallelism);
}
size_t PipelineBuilderContext::degree_of_parallelism_of_source_operator(const SourceOperatorFactory* source_op) const {
return degree_of_parallelism_of_source_operator(source_op->plan_node_id());
}
Pipelines PipelineBuilder::build(const FragmentContext& fragment, ExecNode* exec_node) {
pipeline::OpFactories operators = exec_node->decompose_to_pipeline(&_context);
_context.add_pipeline(operators);

View File

@ -59,6 +59,8 @@ public:
FragmentContext* fragment_context() { return _fragment_context; }
MorselQueue* morsel_queue_of_source_operator(const SourceOperatorFactory* source_op);
size_t degree_of_parallelism_of_source_operator(int32_t source_node_id) const;
size_t degree_of_parallelism_of_source_operator(const SourceOperatorFactory* source_op) const;
private:
static constexpr int kLocalExchangeBufferChunks = 8;

View File

@ -43,6 +43,7 @@ Status PipelineDriver::prepare(RuntimeState* runtime_state) {
_first_input_empty_timer = ADD_CHILD_TIMER(_runtime_profile, "FirstInputEmptyTime", "InputEmptyTime");
_followup_input_empty_timer = ADD_CHILD_TIMER(_runtime_profile, "FollowupInputEmptyTime", "InputEmptyTime");
_output_full_timer = ADD_CHILD_TIMER(_runtime_profile, "OutputFullTime", "PendingTime");
_pending_finish_timer = ADD_CHILD_TIMER(_runtime_profile, "PendingFinishTime", "PendingTime");
DCHECK(_state == DriverState::NOT_READY);
@ -92,11 +93,13 @@ Status PipelineDriver::prepare(RuntimeState* runtime_state) {
_precondition_block_timer_sw = runtime_state->obj_pool()->add(new MonotonicStopWatch());
_input_empty_timer_sw = runtime_state->obj_pool()->add(new MonotonicStopWatch());
_output_full_timer_sw = runtime_state->obj_pool()->add(new MonotonicStopWatch());
_pending_finish_timer_sw = runtime_state->obj_pool()->add(new MonotonicStopWatch());
_total_timer_sw->start();
_pending_timer_sw->start();
_precondition_block_timer_sw->start();
_input_empty_timer_sw->start();
_output_full_timer_sw->start();
_pending_finish_timer_sw->start();
return Status::OK();
}

View File

@ -190,10 +190,12 @@ public:
case DriverState::OUTPUT_FULL:
_output_full_timer->update(_output_full_timer_sw->elapsed_time());
break;
case DriverState::PRECONDITION_BLOCK: {
case DriverState::PRECONDITION_BLOCK:
_precondition_block_timer->update(_precondition_block_timer_sw->elapsed_time());
break;
}
case DriverState::PENDING_FINISH:
_pending_finish_timer->update(_pending_finish_timer_sw->elapsed_time());
break;
default:
break;
}
@ -208,6 +210,9 @@ public:
case DriverState::PRECONDITION_BLOCK:
_precondition_block_timer_sw->reset();
break;
case DriverState::PENDING_FINISH:
_pending_finish_timer_sw->reset();
break;
default:
break;
}
@ -420,12 +425,14 @@ private:
RuntimeProfile::Counter* _first_input_empty_timer = nullptr;
RuntimeProfile::Counter* _followup_input_empty_timer = nullptr;
RuntimeProfile::Counter* _output_full_timer = nullptr;
RuntimeProfile::Counter* _pending_finish_timer = nullptr;
MonotonicStopWatch* _total_timer_sw = nullptr;
MonotonicStopWatch* _pending_timer_sw = nullptr;
MonotonicStopWatch* _precondition_block_timer_sw = nullptr;
MonotonicStopWatch* _input_empty_timer_sw = nullptr;
MonotonicStopWatch* _output_full_timer_sw = nullptr;
MonotonicStopWatch* _pending_finish_timer_sw = nullptr;
};
} // namespace pipeline

View File

@ -5,6 +5,7 @@
#include <memory>
#include "exec/workgroup/work_group.h"
#include "gen_cpp/Types_types.h"
#include "gutil/strings/substitute.h"
#include "runtime/current_thread.h"
#include "util/defer_op.h"
@ -78,6 +79,10 @@ void GlobalDriverExecutor::_worker_thread() {
if (_num_threads_setter.should_shrink()) {
break;
}
// Reset TLS state
CurrentThread::current().set_query_id({});
CurrentThread::current().set_fragment_instance_id({});
CurrentThread::current().set_pipeline_driver_id(0);
auto maybe_driver = this->_driver_queue->take(worker_id);
if (maybe_driver.status().is_cancelled()) {
@ -88,9 +93,9 @@ void GlobalDriverExecutor::_worker_thread() {
auto* query_ctx = driver->query_ctx();
auto* fragment_ctx = driver->fragment_ctx();
tls_thread_status.set_query_id(query_ctx->query_id());
tls_thread_status.set_fragment_instance_id(fragment_ctx->fragment_instance_id());
tls_thread_status.set_pipeline_driver_id(driver->driver_id());
CurrentThread::current().set_query_id(query_ctx->query_id());
CurrentThread::current().set_fragment_instance_id(fragment_ctx->fragment_instance_id());
CurrentThread::current().set_pipeline_driver_id(driver->driver_id());
// TODO(trueeyu): This writing is to ensure that MemTracker will not be destructed before the thread ends.
// This writing method is a bit tricky, and when there is a better way, replace it
@ -302,10 +307,11 @@ void GlobalDriverExecutor::_simplify_common_metrics(RuntimeProfile* driver_profi
DCHECK(common_metrics != nullptr);
// Remove runtime filter related counters if it's value is 0
static std::string counter_names[] = {
"RuntimeInFilterNum", "RuntimeBloomFilterNum", "JoinRuntimeFilterInputRows",
"JoinRuntimeFilterOutputRows", "JoinRuntimeFilterEvaluate", "JoinRuntimeFilterTime",
"ConjunctsInputRows", "ConjunctsOutputRows", "ConjunctsEvaluate"};
static std::string counter_names[] = {"RuntimeInFilterNum", "RuntimeBloomFilterNum",
"JoinRuntimeFilterInputRows", "JoinRuntimeFilterOutputRows",
"JoinRuntimeFilterEvaluate", "JoinRuntimeFilterTime",
"ConjunctsInputRows", "ConjunctsOutputRows",
"ConjunctsEvaluate", "ConjunctsTime"};
for (auto& name : counter_names) {
auto* counter = common_metrics->get_counter(name);
if (counter != nullptr && counter->value() == 0) {

View File

@ -53,7 +53,7 @@ void PipelineDriverPoller::run_internal() {
while (driver_it != local_blocked_drivers.end()) {
auto* driver = *driver_it;
if (driver->query_ctx()->is_expired()) {
if (driver->query_ctx()->is_query_expired()) {
// there are not any drivers belonging to a query context can make progress for an expiration period
// indicates that some fragments are missing because of failed exec_plan_fragment invocation. in
// this situation, query is failed finally, so drivers are marked PENDING_FINISH/FINISH.
@ -63,7 +63,7 @@ void PipelineDriverPoller::run_internal() {
LOG(WARNING) << "[Driver] Timeout, query_id=" << print_id(driver->query_ctx()->query_id())
<< ", instance_id=" << print_id(driver->fragment_ctx()->fragment_instance_id());
driver->fragment_ctx()->cancel(Status::TimedOut(fmt::format(
"Query exceeded time limit of {} seconds", driver->query_ctx()->get_expire_seconds())));
"Query exceeded time limit of {} seconds", driver->query_ctx()->get_query_expire_seconds())));
driver->cancel_operators(driver->fragment_ctx()->runtime_state());
if (driver->is_still_pending_finish()) {
driver->set_driver_state(DriverState::PENDING_FINISH);
@ -137,10 +137,10 @@ void PipelineDriverPoller::run_internal() {
}
void PipelineDriverPoller::add_blocked_driver(const DriverRawPtr driver) {
std::unique_lock<std::mutex> lock(this->_mutex);
this->_blocked_drivers.push_back(driver);
std::unique_lock<std::mutex> lock(_mutex);
_blocked_drivers.push_back(driver);
driver->_pending_timer_sw->reset();
this->_cond.notify_one();
_cond.notify_one();
}
void PipelineDriverPoller::remove_blocked_driver(DriverList& local_blocked_drivers, DriverList::iterator& driver_it) {

View File

@ -15,6 +15,7 @@ Status ProjectOperator::prepare(RuntimeState* state) {
}
void ProjectOperator::close(RuntimeState* state) {
_cur_chunk.reset();
Operator::close(state);
}

View File

@ -15,8 +15,7 @@ QueryContext::QueryContext()
: _fragment_mgr(new FragmentContextManager()),
_total_fragments(0),
_num_fragments(0),
_num_active_fragments(0),
_deadline(0) {}
_num_active_fragments(0) {}
QueryContext::~QueryContext() {
// When destruct FragmentContextManager, we use query-level MemTracker. since when PipelineDriver executor
@ -122,7 +121,7 @@ void QueryContextManager::_clean_slot_unlocked(size_t i) {
auto& sc_map = _second_chance_maps[i];
auto sc_it = sc_map.begin();
while (sc_it != sc_map.end()) {
if (sc_it->second->has_no_active_instances() && sc_it->second->is_expired()) {
if (sc_it->second->has_no_active_instances() && sc_it->second->is_delivery_expired()) {
sc_it = sc_map.erase(sc_it);
} else {
++sc_it;
@ -260,7 +259,7 @@ bool QueryContextManager::remove(const TUniqueId& query_id) {
// in the future, so extend the lifetime of query context and wait for some time till fragments on wire have
// vanished
auto ctx = std::move(it->second);
ctx->extend_lifetime();
ctx->extend_delivery_lifetime();
context_map.erase(it);
sc_map.emplace(query_id, std::move(ctx));
return false;

View File

@ -45,18 +45,28 @@ public:
int num_active_fragments() const { return _num_active_fragments.load(); }
bool has_no_active_instances() { return _num_active_fragments.load() == 0; }
void set_expire_seconds(int expire_seconds) { _expire_seconds = seconds(expire_seconds); }
inline int get_expire_seconds() { return _expire_seconds.count(); }
void set_delivery_expire_seconds(int expire_seconds) { _delivery_expire_seconds = seconds(expire_seconds); }
void set_query_expire_seconds(int expire_seconds) { _query_expire_seconds = seconds(expire_seconds); }
inline int get_query_expire_seconds() const { return _query_expire_seconds.count(); }
// now time point pass by deadline point.
bool is_expired() {
bool is_delivery_expired() const {
auto now = duration_cast<milliseconds>(steady_clock::now().time_since_epoch()).count();
return now > _deadline;
return now > _delivery_deadline;
}
bool is_query_expired() const {
auto now = duration_cast<milliseconds>(steady_clock::now().time_since_epoch()).count();
return now > _query_deadline;
}
bool is_dead() { return _num_active_fragments == 0 && _num_fragments == _total_fragments; }
// add expired seconds to deadline
void extend_lifetime() {
_deadline = duration_cast<milliseconds>(steady_clock::now().time_since_epoch() + _expire_seconds).count();
void extend_delivery_lifetime() {
_delivery_deadline =
duration_cast<milliseconds>(steady_clock::now().time_since_epoch() + _delivery_expire_seconds).count();
}
void extend_query_lifetime() {
_query_deadline =
duration_cast<milliseconds>(steady_clock::now().time_since_epoch() + _query_expire_seconds).count();
}
FragmentContextManager* fragment_mgr();
@ -103,6 +113,12 @@ public:
int64_t query_begin_time() const { return _query_begin_time; }
void init_query_begin_time() { _query_begin_time = MonotonicNanos(); }
void set_scan_limit(int64_t scan_limit) { _scan_limit = scan_limit; }
int64_t get_scan_limit() const { return _scan_limit; }
public:
static constexpr int DEFAULT_EXPIRE_SECONDS = 300;
private:
ExecEnv* _exec_env = nullptr;
TUniqueId _query_id;
@ -110,8 +126,10 @@ private:
size_t _total_fragments;
std::atomic<size_t> _num_fragments;
std::atomic<size_t> _num_active_fragments;
int64_t _deadline;
seconds _expire_seconds;
int64_t _delivery_deadline = 0;
int64_t _query_deadline = 0;
seconds _delivery_expire_seconds = seconds(DEFAULT_EXPIRE_SECONDS);
seconds _query_expire_seconds = seconds(DEFAULT_EXPIRE_SECONDS);
bool _is_runtime_filter_coordinator = false;
std::once_flag _init_mem_tracker_once;
std::shared_ptr<RuntimeProfile> _profile;
@ -125,6 +143,7 @@ private:
std::atomic<int64_t> _cur_scan_rows_num = 0;
std::atomic<int64_t> _cur_scan_bytes = 0;
int64_t _scan_limit = 0;
int64_t _init_wg_cpu_cost = 0;
};

View File

@ -5,6 +5,7 @@
#include "column/chunk.h"
#include "exprs/expr.h"
#include "runtime/buffer_control_block.h"
#include "runtime/current_thread.h"
#include "runtime/exec_env.h"
#include "runtime/mysql_result_writer.h"
#include "runtime/query_statistics.h"
@ -72,13 +73,23 @@ StatusOr<vectorized::ChunkPtr> ResultSinkOperator::pull_chunk(RuntimeState* stat
CHECK(false) << "Shouldn't pull chunk from result sink operator";
}
Status ResultSinkOperator::set_cancelled(RuntimeState* state) {
SCOPED_THREAD_LOCAL_MEM_TRACKER_SETTER(nullptr);
_fetch_data_result.clear();
return Status::OK();
}
bool ResultSinkOperator::need_input() const {
SCOPED_THREAD_LOCAL_MEM_TRACKER_SETTER(nullptr);
if (is_finished()) {
return false;
}
if (_fetch_data_result.empty()) {
return true;
}
auto* mysql_writer = down_cast<MysqlResultWriter*>(_writer.get());
auto status = mysql_writer->try_add_batch(_fetch_data_result);
if (status.ok()) {
@ -90,10 +101,20 @@ bool ResultSinkOperator::need_input() const {
}
Status ResultSinkOperator::push_chunk(RuntimeState* state, const vectorized::ChunkPtr& chunk) {
// The ResultWriter memory that sends the results is no longer recorded to the query memory.
// There are two reason:
// 1. the query result has come out, and then the memory limit is triggered, cancel, it is not necessary
// 2. if this memory is counted, The memory of the receiving thread needs to be recorded,
// and the life cycle of MemTracker needs to be considered
//
// All the places where acquire and release memory of _fetch_data_result must use process_mem_tracker.
SCOPED_THREAD_LOCAL_MEM_TRACKER_SETTER(nullptr);
if (!_last_error.ok()) {
return _last_error;
}
DCHECK(_fetch_data_result.empty());
auto* mysql_writer = down_cast<MysqlResultWriter*>(_writer.get());
auto status = mysql_writer->process_chunk_for_pipeline(chunk.get());
if (status.ok()) {
@ -120,4 +141,5 @@ void ResultSinkOperatorFactory::close(RuntimeState* state) {
Expr::close(_output_expr_ctxs, state);
OperatorFactory::close(state);
}
} // namespace starrocks::pipeline

View File

@ -48,6 +48,8 @@ public:
return Status::OK();
}
Status set_cancelled(RuntimeState* state) override;
StatusOr<vectorized::ChunkPtr> pull_chunk(RuntimeState* state) override;
Status push_chunk(RuntimeState* state, const vectorized::ChunkPtr& chunk) override;

View File

@ -0,0 +1,42 @@
// This file is licensed under the Elastic License 2.0. Copyright 2021 StarRocks Limited.
#include "exec/pipeline/scan/chunk_buffer_limiter.h"
#include "glog/logging.h"
namespace starrocks::pipeline {
void DynamicChunkBufferLimiter::update_avg_row_bytes(size_t added_sum_row_bytes, size_t added_num_rows,
size_t max_chunk_rows) {
std::lock_guard<std::mutex> lock(_mutex);
_sum_row_bytes += added_sum_row_bytes;
_num_rows += added_num_rows;
size_t avg_row_bytes = 0;
if (_num_rows > 0) {
avg_row_bytes = _sum_row_bytes / _num_rows;
}
if (avg_row_bytes == 0) {
return;
}
size_t chunk_mem_usage = avg_row_bytes * max_chunk_rows;
size_t new_capacity = std::max<size_t>(_mem_limit / chunk_mem_usage, 1);
_capacity = std::min(new_capacity, _max_capacity);
}
ChunkBufferTokenPtr DynamicChunkBufferLimiter::pin(int num_chunks) {
size_t prev_value = _pinned_chunks_counter.fetch_add(num_chunks);
if (prev_value + num_chunks > _capacity) {
_unpin(num_chunks);
return nullptr;
}
return std::make_unique<DynamicChunkBufferLimiter::Token>(_pinned_chunks_counter, num_chunks);
}
void DynamicChunkBufferLimiter::_unpin(int num_chunks) {
int prev_value = _pinned_chunks_counter.fetch_sub(num_chunks);
DCHECK_GE(prev_value, 1);
}
} // namespace starrocks::pipeline

View File

@ -0,0 +1,124 @@
// This file is licensed under the Elastic License 2.0. Copyright 2021-present, StarRocks Limited.
#pragma once
#include <atomic>
#include <memory>
#include <mutex>
namespace starrocks::pipeline {
class ChunkBufferToken;
using ChunkBufferTokenPtr = std::unique_ptr<ChunkBufferToken>;
class ChunkBufferLimiter;
using ChunkBufferLimiterPtr = std::unique_ptr<ChunkBufferLimiter>;
class ChunkBufferToken {
public:
virtual ~ChunkBufferToken() = default;
};
// Limit the capacity of a chunk buffer.
// - Before creating a new chunk, should use `pin()` to pin a position in the buffer and return a token.
// - After a chunk is popped from buffer, should desctruct the token to unpin the position.
// All the methods are thread-safe.
class ChunkBufferLimiter {
public:
virtual ~ChunkBufferLimiter() = default;
// Update the chunk memory usage statistics.
// `added_sum_row_bytes` is the bytes of the new reading rows.
// `added_num_rows` is the number of the new read rows.
virtual void update_avg_row_bytes(size_t added_sum_row_bytes, size_t added_num_rows, size_t max_chunk_rows) {}
// Pin a position in the buffer and return a token.
// When desctructing the token, the position will be unpinned.
virtual ChunkBufferTokenPtr pin(int num_chunks) = 0;
// Returns true, when it cannot pin a position for now.
virtual bool is_full() const = 0;
// The number of already pinned positions.
virtual size_t size() const = 0;
// The max number of positions able to be pinned.
virtual size_t capacity() const = 0;
// The default capacity when there isn't chunk memory usage statistics.
virtual size_t default_capacity() const = 0;
};
// The capacity of this limiter is unlimited.
class UnlimitedChunkBufferLimiter final : public ChunkBufferLimiter {
public:
class Token final : public ChunkBufferToken {
public:
~Token() override = default;
};
public:
~UnlimitedChunkBufferLimiter() override = default;
ChunkBufferTokenPtr pin(int num_chunks) override { return std::make_unique<Token>(); }
bool is_full() const override { return false; }
size_t size() const override { return 0; }
size_t capacity() const override { return 0; }
size_t default_capacity() const override { return 0; }
};
// Use the dynamic chunk memory usage statistics to compute the capacity.
class DynamicChunkBufferLimiter final : public ChunkBufferLimiter {
public:
class Token final : public ChunkBufferToken {
public:
Token(std::atomic<int>& acquired_tokens_counter, int num_tokens)
: _acquired_tokens_counter(acquired_tokens_counter), _num_tokens(num_tokens) {}
~Token() override { _acquired_tokens_counter.fetch_sub(_num_tokens); }
// Disable copy/move ctor and assignment.
Token(const Token&) = delete;
Token& operator=(const Token&) = delete;
Token(Token&&) = delete;
Token& operator=(Token&&) = delete;
private:
std::atomic<int>& _acquired_tokens_counter;
const int _num_tokens;
};
public:
DynamicChunkBufferLimiter(size_t max_capacity, size_t default_capacity, int64_t mem_limit, int chunk_size)
: _capacity(default_capacity),
_max_capacity(max_capacity),
_default_capacity(default_capacity),
_mem_limit(mem_limit),
_chunk_size(chunk_size) {}
~DynamicChunkBufferLimiter() override = default;
void update_avg_row_bytes(size_t added_sum_row_bytes, size_t added_num_rows, size_t max_chunk_rows) override;
ChunkBufferTokenPtr pin(int num_chunks) override;
bool is_full() const override { return _pinned_chunks_counter >= _capacity; }
size_t size() const override { return _pinned_chunks_counter; }
size_t capacity() const override { return _capacity; }
size_t default_capacity() const override { return _default_capacity; }
private:
void _unpin(int num_chunks);
private:
std::mutex _mutex;
size_t _sum_row_bytes = 0;
size_t _num_rows = 0;
size_t _capacity;
const size_t _max_capacity;
const size_t _default_capacity;
const int64_t _mem_limit;
const int _chunk_size;
std::atomic<int> _pinned_chunks_counter = 0;
};
} // namespace starrocks::pipeline

View File

@ -6,11 +6,13 @@
#include "column/vectorized_fwd.h"
#include "common/statusor.h"
#include "exec/pipeline/scan/chunk_buffer_limiter.h"
#include "exec/pipeline/scan/morsel.h"
#include "exec/workgroup/work_group_fwd.h"
#include "util/exclusive_ptr.h"
namespace starrocks {
class RuntimeState;
class RuntimeProfile;
@ -19,12 +21,14 @@ namespace pipeline {
class ChunkSource {
public:
ChunkSource(RuntimeProfile* runtime_profile, MorselPtr&& morsel)
: _runtime_profile(runtime_profile), _morsel(std::move(morsel)){};
: _runtime_profile(runtime_profile), _morsel(std::move(morsel)) {}
virtual ~ChunkSource() = default;
virtual Status prepare(RuntimeState* state) = 0;
// Mark that it needn't produce any chunk anymore.
virtual Status set_finished(RuntimeState* state) = 0;
virtual void close(RuntimeState* state) = 0;
// Return true if eos is not reached
@ -43,27 +47,25 @@ public:
size_t* num_read_chunks, int worker_id,
workgroup::WorkGroupPtr running_wg) = 0;
// Some statistic of chunk source
virtual int64_t last_spent_cpu_time_ns() { return 0; }
// Counters of scan
int64_t get_cpu_time_spent() { return _cpu_time_spent_ns; }
int64_t get_scan_rows() const { return _scan_rows_num; }
int64_t get_scan_bytes() const { return _scan_bytes; }
virtual int64_t last_scan_rows_num() {
int64_t res = _last_scan_rows_num;
_last_scan_rows_num = 0;
return res;
}
virtual int64_t last_scan_bytes() {
int64_t res = _last_scan_bytes;
_last_scan_bytes = 0;
return res;
}
void pin_chunk_token(ChunkBufferTokenPtr chunk_token) { _chunk_token = std::move(chunk_token); }
void unpin_chunk_token() { _chunk_token.reset(nullptr); }
protected:
RuntimeProfile* _runtime_profile;
// The morsel will own by pipeline driver
MorselPtr _morsel;
int64_t _last_scan_rows_num = 0;
int64_t _last_scan_bytes = 0;
// NOTE: These counters need to be maintained by ChunkSource implementations, and update in realtime
int64_t _cpu_time_spent_ns = 0;
int64_t _scan_rows_num = 0;
int64_t _scan_bytes = 0;
ChunkBufferTokenPtr _chunk_token = nullptr;
};
using ChunkSourcePtr = std::shared_ptr<ChunkSource>;
@ -71,5 +73,6 @@ using ChunkSourcePromise = std::promise<ChunkSourcePtr>;
using ChunkSourceFromisePtr = starrocks::exclusive_ptr<ChunkSourcePromise>;
using ChunkSourceFuture = std::future<ChunkSourcePtr>;
using OptionalChunkSourceFuture = std::optional<ChunkSourceFuture>;
} // namespace pipeline
} // namespace starrocks

View File

@ -3,6 +3,7 @@
#include "exec/pipeline/scan/connector_scan_operator.h"
#include "column/chunk.h"
#include "exec/pipeline/scan/chunk_buffer_limiter.h"
#include "exec/vectorized/connector_scan_node.h"
#include "exec/workgroup/work_group.h"
#include "runtime/exec_env.h"
@ -12,8 +13,9 @@ namespace starrocks::pipeline {
// ==================== ConnectorScanOperatorFactory ====================
ConnectorScanOperatorFactory::ConnectorScanOperatorFactory(int32_t id, ScanNode* scan_node)
: ScanOperatorFactory(id, scan_node) {}
ConnectorScanOperatorFactory::ConnectorScanOperatorFactory(int32_t id, ScanNode* scan_node,
ChunkBufferLimiterPtr buffer_limiter)
: ScanOperatorFactory(id, scan_node, std::move(buffer_limiter)) {}
Status ConnectorScanOperatorFactory::do_prepare(RuntimeState* state) {
const auto& conjunct_ctxs = _scan_node->conjunct_ctxs();
@ -29,16 +31,14 @@ void ConnectorScanOperatorFactory::do_close(RuntimeState* state) {
}
OperatorPtr ConnectorScanOperatorFactory::do_create(int32_t dop, int32_t driver_sequence) {
return std::make_shared<ConnectorScanOperator>(this, _id, driver_sequence, _scan_node, _max_scan_concurrency,
_num_committed_scan_tasks);
return std::make_shared<ConnectorScanOperator>(this, _id, driver_sequence, _scan_node, _buffer_limiter.get());
}
// ==================== ConnectorScanOperator ====================
ConnectorScanOperator::ConnectorScanOperator(OperatorFactory* factory, int32_t id, int32_t driver_sequence,
ScanNode* scan_node, int max_scan_concurrency,
std::atomic<int>& num_committed_scan_tasks)
: ScanOperator(factory, id, driver_sequence, scan_node, max_scan_concurrency, num_committed_scan_tasks) {}
ScanNode* scan_node, ChunkBufferLimiter* buffer_limiter)
: ScanOperator(factory, id, driver_sequence, scan_node, buffer_limiter) {}
Status ConnectorScanOperator::do_prepare(RuntimeState* state) {
return Status::OK();
@ -47,19 +47,21 @@ Status ConnectorScanOperator::do_prepare(RuntimeState* state) {
void ConnectorScanOperator::do_close(RuntimeState* state) {}
ChunkSourcePtr ConnectorScanOperator::create_chunk_source(MorselPtr morsel, int32_t chunk_source_index) {
vectorized::ConnectorScanNode* scan_node = down_cast<vectorized::ConnectorScanNode*>(_scan_node);
auto* scan_node = down_cast<vectorized::ConnectorScanNode*>(_scan_node);
return std::make_shared<ConnectorChunkSource>(_chunk_source_profiles[chunk_source_index].get(), std::move(morsel),
this, scan_node);
this, scan_node, _buffer_limiter);
}
// ==================== ConnectorChunkSource ====================
ConnectorChunkSource::ConnectorChunkSource(RuntimeProfile* runtime_profile, MorselPtr&& morsel, ScanOperator* op,
vectorized::ConnectorScanNode* scan_node)
vectorized::ConnectorScanNode* scan_node,
ChunkBufferLimiter* const buffer_limiter)
: ChunkSource(runtime_profile, std::move(morsel)),
_scan_node(scan_node),
_limit(scan_node->limit()),
_runtime_in_filters(op->runtime_in_filters()),
_runtime_bloom_filters(op->runtime_bloom_filters()) {
_runtime_bloom_filters(op->runtime_bloom_filters()),
_buffer_limiter(buffer_limiter) {
_conjunct_ctxs = scan_node->conjunct_ctxs();
_conjunct_ctxs.insert(_conjunct_ctxs.end(), _runtime_in_filters.begin(), _runtime_in_filters.end());
ScanMorsel* scan_morsel = (ScanMorsel*)_morsel.get();
@ -78,9 +80,14 @@ ConnectorChunkSource::~ConnectorChunkSource() {
}
Status ConnectorChunkSource::prepare(RuntimeState* state) {
// semantics of `prepare` in ChunkSource is identical to `open`
_runtime_state = state;
RETURN_IF_ERROR(_data_source->open(state));
return Status::OK();
}
Status ConnectorChunkSource::set_finished(RuntimeState* state) {
_chunk_buffer.shutdown();
_chunk_buffer.clear();
return Status::OK();
}
@ -88,6 +95,7 @@ void ConnectorChunkSource::close(RuntimeState* state) {
if (_closed) return;
_closed = true;
_data_source->close(state);
set_finished(state);
}
bool ConnectorChunkSource::has_next_chunk() const {
@ -105,9 +113,10 @@ size_t ConnectorChunkSource::get_buffer_size() const {
}
StatusOr<vectorized::ChunkPtr> ConnectorChunkSource::get_next_chunk_from_buffer() {
vectorized::ChunkPtr chunk = nullptr;
// Will release the token after exiting this scope.
ChunkWithToken chunk = std::make_pair(nullptr, nullptr);
_chunk_buffer.try_get(&chunk);
return chunk;
return std::move(chunk.first);
}
Status ConnectorChunkSource::buffer_next_batch_chunks_blocking(size_t batch_size, RuntimeState* state) {
@ -116,16 +125,22 @@ Status ConnectorChunkSource::buffer_next_batch_chunks_blocking(size_t batch_size
}
for (size_t i = 0; i < batch_size && !state->is_cancelled(); ++i) {
if (_chunk_token == nullptr && (_chunk_token = _buffer_limiter->pin(1)) == nullptr) {
return Status::OK();
}
vectorized::ChunkPtr chunk;
_status = _read_chunk(&chunk);
if (!_status.ok()) {
// end of file is normal case, need process chunk
if (_status.is_end_of_file()) {
_chunk_buffer.put(std::move(chunk));
_chunk_buffer.put(std::make_pair(std::move(chunk), std::move(_chunk_token)));
}
break;
}
_chunk_buffer.put(std::move(chunk));
if (!_chunk_buffer.put(std::make_pair(std::move(chunk), std::move(_chunk_token)))) {
break;
}
}
return _status;
}
@ -139,6 +154,10 @@ Status ConnectorChunkSource::buffer_next_batch_chunks_blocking_for_workgroup(siz
int64_t time_spent = 0;
for (size_t i = 0; i < batch_size && !state->is_cancelled(); ++i) {
{
if (_chunk_token == nullptr && (_chunk_token = _buffer_limiter->pin(1)) == nullptr) {
return Status::OK();
}
SCOPED_RAW_TIMER(&time_spent);
vectorized::ChunkPtr chunk;
@ -147,13 +166,15 @@ Status ConnectorChunkSource::buffer_next_batch_chunks_blocking_for_workgroup(siz
// end of file is normal case, need process chunk
if (_status.is_end_of_file()) {
++(*num_read_chunks);
_chunk_buffer.put(std::move(chunk));
_chunk_buffer.put(std::make_pair(std::move(chunk), std::move(_chunk_token)));
}
break;
}
++(*num_read_chunks);
_chunk_buffer.put(std::move(chunk));
if (!_chunk_buffer.put(std::make_pair(std::move(chunk), std::move(_chunk_token)))) {
break;
}
}
if (time_spent >= YIELD_MAX_TIME_SPENT) {
@ -161,7 +182,8 @@ Status ConnectorChunkSource::buffer_next_batch_chunks_blocking_for_workgroup(siz
}
if (time_spent >= YIELD_PREEMPT_MAX_TIME_SPENT &&
workgroup::WorkGroupManager::instance()->get_owners_of_scan_worker(worker_id, running_wg)) {
workgroup::WorkGroupManager::instance()->get_owners_of_scan_worker(workgroup::TypeHdfsScanExecutor,
worker_id, running_wg)) {
break;
}
}
@ -171,6 +193,11 @@ Status ConnectorChunkSource::buffer_next_batch_chunks_blocking_for_workgroup(siz
Status ConnectorChunkSource::_read_chunk(vectorized::ChunkPtr* chunk) {
RuntimeState* state = _runtime_state;
if (!_opened) {
RETURN_IF_ERROR(_data_source->open(state));
_opened = true;
}
if (state->is_cancelled()) {
return Status::Cancelled("canceled state");
}
@ -183,7 +210,10 @@ Status ConnectorChunkSource::_read_chunk(vectorized::ChunkPtr* chunk) {
do {
RETURN_IF_ERROR(_data_source->get_next(state, chunk));
} while ((*chunk)->num_rows() == 0);
_rows_read += (*chunk)->num_rows();
_scan_rows_num = _data_source->raw_rows_read();
_scan_bytes = _data_source->num_bytes_read();
return Status::OK();
}

View File

@ -13,9 +13,13 @@ class ScanNode;
namespace pipeline {
class ChunkBufferToken;
using ChunkBufferTokenPtr = std::unique_ptr<ChunkBufferToken>;
class ChunkBufferLimiter;
class ConnectorScanOperatorFactory final : public ScanOperatorFactory {
public:
ConnectorScanOperatorFactory(int32_t id, ScanNode* scan_node);
ConnectorScanOperatorFactory(int32_t id, ScanNode* scan_node, ChunkBufferLimiterPtr buffer_limiter);
~ConnectorScanOperatorFactory() override = default;
@ -27,7 +31,7 @@ public:
class ConnectorScanOperator final : public ScanOperator {
public:
ConnectorScanOperator(OperatorFactory* factory, int32_t id, int32_t driver_sequence, ScanNode* scan_node,
int max_scan_concurrency, std::atomic<int>& num_committed_scan_tasks);
ChunkBufferLimiter* buffer_limiter);
~ConnectorScanOperator() override = default;
@ -41,12 +45,13 @@ private:
class ConnectorChunkSource final : public ChunkSource {
public:
ConnectorChunkSource(RuntimeProfile* runtime_profile, MorselPtr&& morsel, ScanOperator* op,
vectorized::ConnectorScanNode* scan_node);
vectorized::ConnectorScanNode* scan_node, ChunkBufferLimiter* const buffer_limiter);
~ConnectorChunkSource() override;
Status prepare(RuntimeState* state) override;
Status set_finished(RuntimeState* state) override;
void close(RuntimeState* state) override;
bool has_next_chunk() const override;
@ -63,6 +68,8 @@ public:
workgroup::WorkGroupPtr running_wg) override;
private:
using ChunkWithToken = std::pair<vectorized::ChunkPtr, ChunkBufferTokenPtr>;
Status _read_chunk(vectorized::ChunkPtr* chunk);
// Yield scan io task when maximum time in nano-seconds has spent in current execution round.
@ -84,9 +91,13 @@ private:
// =========================
RuntimeState* _runtime_state = nullptr;
Status _status = Status::OK();
bool _opened = false;
bool _closed = false;
uint64_t _rows_read = 0;
UnboundedBlockingQueue<vectorized::ChunkPtr> _chunk_buffer;
uint64_t _bytes_read = 0;
UnboundedBlockingQueue<ChunkWithToken> _chunk_buffer;
ChunkBufferLimiter* const _buffer_limiter;
};
} // namespace pipeline

View File

@ -87,7 +87,11 @@ StatusOr<MorselPtr> PhysicalSplitMorselQueue::try_get() {
return nullptr;
}
RETURN_IF_ERROR(_init_segment());
if (auto status = _init_segment(); !status.ok()) {
// Morsel_queue cannot generate morsels after errors occurring.
_tablet_idx = _tablets.size();
return status;
}
}
vectorized::SparseRange taken_range;

View File

@ -90,7 +90,8 @@ public:
virtual void set_tablets(const std::vector<TabletSharedPtr>& tablets) {}
virtual void set_tablet_rowsets(const std::vector<std::vector<RowsetSharedPtr>>& tablet_rowsets) {}
virtual size_t num_morsels() const = 0;
virtual size_t num_original_morsels() const = 0;
virtual size_t max_degree_of_parallelism() const = 0;
virtual bool empty() const = 0;
virtual StatusOr<MorselPtr> try_get() = 0;
@ -108,7 +109,8 @@ public:
std::vector<TInternalScanRange*> olap_scan_ranges() const override;
size_t num_morsels() const override { return _num_morsels; }
size_t num_original_morsels() const override { return _num_morsels; }
size_t max_degree_of_parallelism() const override { return _num_morsels; }
bool empty() const override { return _pop_index >= _num_morsels; }
StatusOr<MorselPtr> try_get() override;
@ -137,7 +139,8 @@ public:
_tablet_rowsets = tablet_rowsets;
}
size_t num_morsels() const override { return _degree_of_parallelism; }
size_t num_original_morsels() const override { return _morsels.size(); }
size_t max_degree_of_parallelism() const override { return _degree_of_parallelism; }
bool empty() const override { return _tablet_idx >= _tablets.size(); }
StatusOr<MorselPtr> try_get() override;

View File

@ -4,6 +4,7 @@
#include "column/column_helper.h"
#include "common/constexpr.h"
#include "exec/pipeline/scan/chunk_buffer_limiter.h"
#include "exec/pipeline/scan/olap_scan_context.h"
#include "exec/pipeline/scan/scan_operator.h"
#include "exec/vectorized/olap_scan_node.h"
@ -25,23 +26,33 @@ namespace starrocks::pipeline {
using namespace vectorized;
OlapChunkSource::OlapChunkSource(RuntimeProfile* runtime_profile, MorselPtr&& morsel,
vectorized::OlapScanNode* scan_node, OlapScanContext* scan_ctx)
vectorized::OlapScanNode* scan_node, OlapScanContext* scan_ctx,
ChunkBufferLimiter* const buffer_limiter)
: ChunkSource(runtime_profile, std::move(morsel)),
_scan_node(scan_node),
_scan_ctx(scan_ctx),
_limit(scan_node->limit()),
_scan_range(down_cast<ScanMorsel*>(_morsel.get())->get_olap_scan_range()) {}
_scan_range(down_cast<ScanMorsel*>(_morsel.get())->get_olap_scan_range()),
_buffer_limiter(buffer_limiter) {}
OlapChunkSource::~OlapChunkSource() {
_reader.reset();
_predicate_free_pool.clear();
}
Status OlapChunkSource::set_finished(RuntimeState* state) {
_chunk_buffer.shutdown();
_chunk_buffer.clear();
return Status::OK();
}
void OlapChunkSource::close(RuntimeState* state) {
_update_counter();
_prj_iter->close();
_reader.reset();
_predicate_free_pool.clear();
set_finished(state);
}
Status OlapChunkSource::prepare(RuntimeState* state) {
@ -290,9 +301,10 @@ size_t OlapChunkSource::get_buffer_size() const {
}
StatusOr<vectorized::ChunkPtr> OlapChunkSource::get_next_chunk_from_buffer() {
vectorized::ChunkPtr chunk = nullptr;
// Will release the token after exiting this scope.
ChunkWithToken chunk = std::make_pair(nullptr, nullptr);
_chunk_buffer.try_get(&chunk);
return chunk;
return std::move(chunk.first);
}
Status OlapChunkSource::buffer_next_batch_chunks_blocking(size_t batch_size, RuntimeState* state) {
@ -302,18 +314,26 @@ Status OlapChunkSource::buffer_next_batch_chunks_blocking(size_t batch_size, Run
using namespace vectorized;
for (size_t i = 0; i < batch_size && !state->is_cancelled(); ++i) {
if (_chunk_token == nullptr && (_chunk_token = _buffer_limiter->pin(1)) == nullptr) {
return Status::OK();
}
ChunkUniquePtr chunk(
ChunkHelper::new_chunk_pooled(_prj_iter->output_schema(), _runtime_state->chunk_size(), true));
_status = _read_chunk_from_storage(_runtime_state, chunk.get());
if (!_status.ok()) {
// end of file is normal case, need process chunk
if (_status.is_end_of_file()) {
_chunk_buffer.put(std::move(chunk));
_chunk_buffer.put(std::make_pair(std::move(chunk), std::move(_chunk_token)));
}
break;
}
_chunk_buffer.put(std::move(chunk));
if (!_chunk_buffer.put(std::make_pair(std::move(chunk), std::move(_chunk_token)))) {
break;
}
}
return _status;
}
@ -323,11 +343,15 @@ Status OlapChunkSource::buffer_next_batch_chunks_blocking_for_workgroup(size_t b
if (!_status.ok()) {
return _status;
}
using namespace vectorized;
int64_t time_spent = 0;
for (size_t i = 0; i < batch_size && !state->is_cancelled(); ++i) {
{
if (_chunk_token == nullptr && (_chunk_token = _buffer_limiter->pin(1)) == nullptr) {
return Status::OK();
}
SCOPED_RAW_TIMER(&time_spent);
ChunkUniquePtr chunk(
@ -337,13 +361,15 @@ Status OlapChunkSource::buffer_next_batch_chunks_blocking_for_workgroup(size_t b
// end of file is normal case, need process chunk
if (_status.is_end_of_file()) {
++(*num_read_chunks);
_chunk_buffer.put(std::move(chunk));
_chunk_buffer.put(std::make_pair(std::move(chunk), std::move(_chunk_token)));
}
break;
}
++(*num_read_chunks);
_chunk_buffer.put(std::move(chunk));
if (!_chunk_buffer.put(std::make_pair(std::move(chunk), std::move(_chunk_token)))) {
break;
}
}
if (time_spent >= YIELD_MAX_TIME_SPENT) {
@ -351,7 +377,8 @@ Status OlapChunkSource::buffer_next_batch_chunks_blocking_for_workgroup(size_t b
}
if (time_spent >= YIELD_PREEMPT_MAX_TIME_SPENT &&
workgroup::WorkGroupManager::instance()->get_owners_of_scan_worker(worker_id, running_wg)) {
workgroup::WorkGroupManager::instance()->get_owners_of_scan_worker(workgroup::TypeOlapScanExecutor,
worker_id, running_wg)) {
break;
}
}
@ -416,6 +443,7 @@ Status OlapChunkSource::_read_chunk_from_storage(RuntimeState* state, vectorized
TRY_CATCH_ALLOC_SCOPE_END()
} while (chunk->num_rows() == 0);
_update_realtime_counter(chunk);
// Improve for select * from table limit x, x is small
if (_limit != -1 && _num_rows_read >= _limit) {
@ -424,26 +452,22 @@ Status OlapChunkSource::_read_chunk_from_storage(RuntimeState* state, vectorized
return Status::OK();
}
int64_t OlapChunkSource::last_spent_cpu_time_ns() {
int64_t time_ns = _last_spent_cpu_time_ns;
_last_spent_cpu_time_ns += _reader->stats().decompress_ns;
_last_spent_cpu_time_ns += _reader->stats().vec_cond_ns;
_last_spent_cpu_time_ns += _reader->stats().del_filter_ns;
return _last_spent_cpu_time_ns - time_ns;
}
void OlapChunkSource::_update_realtime_counter(vectorized::Chunk* chunk) {
COUNTER_UPDATE(_read_compressed_counter, _reader->stats().compressed_bytes_read);
_compressed_bytes_read += _reader->stats().compressed_bytes_read;
_reader->mutable_stats()->compressed_bytes_read = 0;
COUNTER_UPDATE(_raw_rows_counter, _reader->stats().raw_rows_read);
_raw_rows_read += _reader->stats().raw_rows_read;
_last_scan_rows_num += _reader->stats().raw_rows_read;
_last_scan_bytes += _reader->stats().bytes_read;
_reader->mutable_stats()->raw_rows_read = 0;
auto& stats = _reader->stats();
_num_rows_read += chunk->num_rows();
_scan_rows_num = stats.raw_rows_read;
_scan_bytes = stats.bytes_read;
_cpu_time_spent_ns = stats.decompress_ns + stats.vec_cond_ns + stats.del_filter_ns;
// Update local counters.
_local_sum_row_bytes += chunk->memory_usage();
_local_num_rows += chunk->num_rows();
_local_max_chunk_rows = std::max(_local_max_chunk_rows, chunk->num_rows());
if (_local_sum_chunks++ % UPDATE_AVG_ROW_BYTES_FREQUENCY == 0) {
_buffer_limiter->update_avg_row_bytes(_local_sum_row_bytes, _local_num_rows, _local_max_chunk_rows);
_local_sum_row_bytes = 0;
_local_num_rows = 0;
}
}
void OlapChunkSource::_update_counter() {
@ -452,7 +476,6 @@ void OlapChunkSource::_update_counter() {
COUNTER_UPDATE(_io_timer, _reader->stats().io_ns);
COUNTER_UPDATE(_read_compressed_counter, _reader->stats().compressed_bytes_read);
_compressed_bytes_read += _reader->stats().compressed_bytes_read;
COUNTER_UPDATE(_decompress_timer, _reader->stats().decompress_ns);
COUNTER_UPDATE(_read_uncompressed_counter, _reader->stats().uncompressed_bytes_read);
COUNTER_UPDATE(_bytes_read_counter, _reader->stats().bytes_read);
@ -462,15 +485,11 @@ void OlapChunkSource::_update_counter() {
COUNTER_UPDATE(_block_fetch_timer, _reader->stats().block_fetch_ns);
COUNTER_UPDATE(_block_seek_timer, _reader->stats().block_seek_ns);
COUNTER_UPDATE(_raw_rows_counter, _reader->stats().raw_rows_read);
_raw_rows_read += _reader->mutable_stats()->raw_rows_read;
_last_scan_rows_num += _reader->mutable_stats()->raw_rows_read;
_last_scan_bytes += _reader->mutable_stats()->bytes_read;
COUNTER_UPDATE(_chunk_copy_timer, _reader->stats().vec_cond_chunk_copy_ns);
COUNTER_UPDATE(_seg_init_timer, _reader->stats().segment_init_ns);
COUNTER_UPDATE(_raw_rows_counter, _reader->stats().raw_rows_read);
int64_t cond_evaluate_ns = 0;
cond_evaluate_ns += _reader->stats().vec_cond_evaluate_ns;
cond_evaluate_ns += _reader->stats().branchless_cond_evaluate_ns;
@ -500,8 +519,8 @@ void OlapChunkSource::_update_counter() {
COUNTER_SET(_pushdown_predicates_counter, (int64_t)_params.predicates.size());
StarRocksMetrics::instance()->query_scan_bytes.increment(_compressed_bytes_read);
StarRocksMetrics::instance()->query_scan_rows.increment(_raw_rows_read);
StarRocksMetrics::instance()->query_scan_bytes.increment(_scan_bytes);
StarRocksMetrics::instance()->query_scan_rows.increment(_scan_rows_num);
if (_reader->stats().decode_dict_ns > 0) {
RuntimeProfile::Counter* c = ADD_TIMER(_runtime_profile, "DictDecode");

View File

@ -29,16 +29,20 @@ namespace pipeline {
class ScanOperator;
class OlapScanContext;
class ChunkBufferToken;
using ChunkBufferTokenPtr = std::unique_ptr<ChunkBufferToken>;
class ChunkBufferLimiter;
class OlapChunkSource final : public ChunkSource {
public:
OlapChunkSource(RuntimeProfile* runtime_profile, MorselPtr&& morsel, vectorized::OlapScanNode* scan_node,
OlapScanContext* scan_ctx);
OlapScanContext* scan_ctx, ChunkBufferLimiter* const buffer_limiter);
~OlapChunkSource() override;
Status prepare(RuntimeState* state) override;
Status set_finished(RuntimeState* state) override;
void close(RuntimeState* state) override;
bool has_next_chunk() const override;
@ -54,8 +58,6 @@ public:
size_t* num_read_chunks, int worker_id,
workgroup::WorkGroupPtr running_wg) override;
int64_t last_spent_cpu_time_ns() override;
private:
// Yield scan io task when maximum time in nano-seconds has spent in current execution round.
static constexpr int64_t YIELD_MAX_TIME_SPENT = 100'000'000L;
@ -63,6 +65,8 @@ private:
// if it runs in the worker thread owned by other workgroup, which has running drivers.
static constexpr int64_t YIELD_PREEMPT_MAX_TIME_SPENT = 20'000'000L;
static constexpr int UPDATE_AVG_ROW_BYTES_FREQUENCY = 8;
Status _get_tablet(const TInternalScanRange* scan_range);
Status _init_reader_params(const std::vector<std::unique_ptr<OlapScanRange>>& key_ranges,
const std::vector<uint32_t>& scanner_columns, std::vector<uint32_t>& reader_columns);
@ -77,6 +81,8 @@ private:
void _decide_chunk_size();
private:
using ChunkWithToken = std::pair<vectorized::ChunkPtr, ChunkBufferTokenPtr>;
vectorized::TabletReaderParams _params{};
vectorized::OlapScanNode* _scan_node;
OlapScanContext* _scan_ctx;
@ -85,7 +91,8 @@ private:
TInternalScanRange* _scan_range;
Status _status = Status::OK();
UnboundedBlockingQueue<vectorized::ChunkPtr> _chunk_buffer;
UnboundedBlockingQueue<ChunkWithToken> _chunk_buffer;
ChunkBufferLimiter* const _buffer_limiter;
vectorized::ConjunctivePredicates _not_push_down_predicates;
std::vector<uint8_t> _selection;
@ -113,9 +120,12 @@ private:
// The following are profile meatures
int64_t _num_rows_read = 0;
int64_t _raw_rows_read = 0;
int64_t _compressed_bytes_read = 0;
int64_t _last_spent_cpu_time_ns = 0;
// Local counters for row-size estimation, will be reset after a batch
size_t _local_sum_row_bytes = 0;
size_t _local_num_rows = 0;
size_t _local_sum_chunks = 0;
size_t _local_max_chunk_rows = 0;
RuntimeProfile::Counter* _bytes_read_counter = nullptr;
RuntimeProfile::Counter* _rows_read_counter = nullptr;

View File

@ -3,6 +3,7 @@
#include "exec/pipeline/scan/olap_scan_operator.h"
#include "column/chunk.h"
#include "exec/pipeline/scan/chunk_buffer_limiter.h"
#include "exec/pipeline/scan/olap_chunk_source.h"
#include "exec/pipeline/scan/olap_scan_context.h"
#include "exec/vectorized/olap_scan_node.h"
@ -18,8 +19,9 @@ namespace starrocks::pipeline {
// ==================== OlapScanOperatorFactory ====================
OlapScanOperatorFactory::OlapScanOperatorFactory(int32_t id, ScanNode* scan_node, OlapScanContextPtr ctx)
: ScanOperatorFactory(id, scan_node), _ctx(std::move(ctx)) {}
OlapScanOperatorFactory::OlapScanOperatorFactory(int32_t id, ScanNode* scan_node, ChunkBufferLimiterPtr buffer_limiter,
OlapScanContextPtr ctx)
: ScanOperatorFactory(id, scan_node, std::move(buffer_limiter)), _ctx(std::move(ctx)) {}
Status OlapScanOperatorFactory::do_prepare(RuntimeState* state) {
return Status::OK();
@ -28,17 +30,14 @@ Status OlapScanOperatorFactory::do_prepare(RuntimeState* state) {
void OlapScanOperatorFactory::do_close(RuntimeState*) {}
OperatorPtr OlapScanOperatorFactory::do_create(int32_t dop, int32_t driver_sequence) {
return std::make_shared<OlapScanOperator>(this, _id, driver_sequence, _scan_node, _max_scan_concurrency,
_num_committed_scan_tasks, _ctx);
return std::make_shared<OlapScanOperator>(this, _id, driver_sequence, _scan_node, _buffer_limiter.get(), _ctx);
}
// ==================== OlapScanOperator ====================
OlapScanOperator::OlapScanOperator(OperatorFactory* factory, int32_t id, int32_t driver_sequence, ScanNode* scan_node,
int max_scan_concurrency, std::atomic<int>& num_committed_scan_tasks,
OlapScanContextPtr ctx)
: ScanOperator(factory, id, driver_sequence, scan_node, max_scan_concurrency, num_committed_scan_tasks),
_ctx(std::move(ctx)) {
ChunkBufferLimiter* buffer_limiter, OlapScanContextPtr ctx)
: ScanOperator(factory, id, driver_sequence, scan_node, buffer_limiter), _ctx(std::move(ctx)) {
_ctx->ref();
}
@ -83,7 +82,7 @@ void OlapScanOperator::do_close(RuntimeState* state) {}
ChunkSourcePtr OlapScanOperator::create_chunk_source(MorselPtr morsel, int32_t chunk_source_index) {
auto* olap_scan_node = down_cast<vectorized::OlapScanNode*>(_scan_node);
return std::make_shared<OlapChunkSource>(_chunk_source_profiles[chunk_source_index].get(), std::move(morsel),
olap_scan_node, _ctx.get());
olap_scan_node, _ctx.get(), _buffer_limiter);
}
} // namespace starrocks::pipeline

View File

@ -18,7 +18,8 @@ using OlapScanContextPtr = std::shared_ptr<OlapScanContext>;
class OlapScanOperatorFactory final : public ScanOperatorFactory {
public:
OlapScanOperatorFactory(int32_t id, ScanNode* scan_node, OlapScanContextPtr ctx);
OlapScanOperatorFactory(int32_t id, ScanNode* scan_node, ChunkBufferLimiterPtr buffer_limiter,
OlapScanContextPtr ctx);
~OlapScanOperatorFactory() override = default;
@ -33,7 +34,7 @@ private:
class OlapScanOperator final : public ScanOperator {
public:
OlapScanOperator(OperatorFactory* factory, int32_t id, int32_t driver_sequence, ScanNode* scan_node,
int max_scan_concurrency, std::atomic<int>& num_committed_scan_tasks, OlapScanContextPtr ctx);
ChunkBufferLimiter* buffer_limiter, OlapScanContextPtr ctx);
~OlapScanOperator() override;

View File

@ -3,8 +3,10 @@
#include "exec/pipeline/scan/scan_operator.h"
#include "column/chunk.h"
#include "exec/pipeline/chunk_accumulate_operator.h"
#include "exec/pipeline/limit_operator.h"
#include "exec/pipeline/pipeline_builder.h"
#include "exec/pipeline/scan/chunk_buffer_limiter.h"
#include "exec/pipeline/scan/connector_scan_operator.h"
#include "exec/vectorized/olap_scan_node.h"
#include "exec/workgroup/scan_executor.h"
@ -17,12 +19,11 @@ namespace starrocks::pipeline {
// ========== ScanOperator ==========
ScanOperator::ScanOperator(OperatorFactory* factory, int32_t id, int32_t driver_sequence, ScanNode* scan_node,
int max_scan_concurrency, std::atomic<int>& num_committed_scan_tasks)
ChunkBufferLimiter* buffer_limiter)
: SourceOperator(factory, id, scan_node->name(), scan_node->id(), driver_sequence),
_scan_node(scan_node),
_chunk_source_profiles(MAX_IO_TASKS_PER_OP),
_max_scan_concurrency(max_scan_concurrency),
_num_committed_scan_tasks(num_committed_scan_tasks),
_buffer_limiter(buffer_limiter),
_is_io_task_running(MAX_IO_TASKS_PER_OP),
_chunk_sources(MAX_IO_TASKS_PER_OP) {
for (auto i = 0; i < MAX_IO_TASKS_PER_OP; i++) {
@ -48,8 +49,8 @@ Status ScanOperator::prepare(RuntimeState* state) {
RETURN_IF_ERROR(SourceOperator::prepare(state));
_unique_metrics->add_info_string("MorselQueueType", _morsel_queue->name());
auto* max_scan_concurrency_counter = ADD_COUNTER(_unique_metrics, "MaxScanConcurrency", TUnit::UNIT);
COUNTER_SET(max_scan_concurrency_counter, static_cast<int64_t>(_max_scan_concurrency));
_peak_buffer_size_counter = _unique_metrics->AddHighWaterMarkCounter("PeakChunkBufferSize", TUnit::UNIT);
_morsels_counter = ADD_COUNTER(_unique_metrics, "MorselsCount", TUnit::UNIT);
if (_workgroup == nullptr) {
DCHECK(_io_threads != nullptr);
@ -73,13 +74,25 @@ void ScanOperator::close(RuntimeState* state) {
}
// For the running io task, we close its chunk sources in ~ScanOperator not in ScanOperator::close.
for (size_t i = 0; i < _chunk_sources.size(); i++) {
if (_chunk_sources[i] != nullptr && !_is_io_task_running[i]) {
_chunk_sources[i]->close(state);
_chunk_sources[i] = nullptr;
if (_chunk_sources[i] != nullptr) {
_chunk_sources[i]->set_finished(state);
if (!_is_io_task_running[i]) {
_chunk_sources[i]->close(state);
_chunk_sources[i] = nullptr;
}
}
}
_default_buffer_capacity_counter = ADD_COUNTER(_unique_metrics, "DefaultChunkBufferCapacity", TUnit::UNIT);
COUNTER_SET(_default_buffer_capacity_counter, static_cast<int64_t>(_buffer_limiter->default_capacity()));
_buffer_capacity_counter = ADD_COUNTER(_unique_metrics, "ChunkBufferCapacity", TUnit::UNIT);
COUNTER_SET(_buffer_capacity_counter, static_cast<int64_t>(_buffer_limiter->capacity()));
_tablets_counter = ADD_COUNTER(_unique_metrics, "TabletCount", TUnit::UNIT);
COUNTER_SET(_tablets_counter, static_cast<int64_t>(_morsel_queue->num_original_morsels()));
_merge_chunk_source_profiles();
do_close(state);
Operator::close(state);
}
@ -99,8 +112,7 @@ bool ScanOperator::has_output() const {
}
}
if (_num_running_io_tasks >= MAX_IO_TASKS_PER_OP ||
_exceed_max_scan_concurrency(_num_committed_scan_tasks.load())) {
if (_num_running_io_tasks >= MAX_IO_TASKS_PER_OP || _buffer_limiter->is_full()) {
return false;
}
@ -159,6 +171,9 @@ Status ScanOperator::set_finishing(RuntimeState* state) {
StatusOr<vectorized::ChunkPtr> ScanOperator::pull_chunk(RuntimeState* state) {
RETURN_IF_ERROR(_get_scan_status());
_peak_buffer_size_counter->set(_buffer_limiter->size());
RETURN_IF_ERROR(_try_to_trigger_next_scan(state));
if (_workgroup != nullptr) {
_workgroup->incr_period_ask_chunk_num(1);
@ -190,19 +205,17 @@ Status ScanOperator::_try_to_trigger_next_scan(RuntimeState* state) {
return Status::OK();
}
// Firstly, find the picked-up morsel, whose can commit an io task.
for (int i = 0; i < MAX_IO_TASKS_PER_OP; ++i) {
if (_chunk_sources[i] != nullptr && !_is_io_task_running[i] && _chunk_sources[i]->has_next_chunk()) {
RETURN_IF_ERROR(_trigger_next_scan(state, i));
if (_is_io_task_running[i]) {
continue;
}
}
// Secondly, find the unused position of _chunk_sources to pick up a new morsel.
if (!_morsel_queue->empty()) {
for (int i = 0; i < MAX_IO_TASKS_PER_OP; ++i) {
if (_chunk_sources[i] == nullptr || (!_is_io_task_running[i] && !_chunk_sources[i]->has_output())) {
RETURN_IF_ERROR(_pickup_morsel(state, i));
}
if (_chunk_sources[i] == nullptr) {
RETURN_IF_ERROR(_pickup_morsel(state, i));
} else if (_chunk_sources[i]->has_next_chunk()) {
RETURN_IF_ERROR(_trigger_next_scan(state, i));
} else if (!_chunk_sources[i]->has_output()) {
RETURN_IF_ERROR(_pickup_morsel(state, i));
}
}
@ -218,11 +231,22 @@ inline bool is_uninitialized(const std::weak_ptr<QueryContext>& ptr) {
return !ptr.owner_before(wp{}) && !wp{}.owner_before(ptr);
}
void ScanOperator::_finish_chunk_source_task(RuntimeState* state, int chunk_source_index, int64_t cpu_time_ns,
int64_t scan_rows, int64_t scan_bytes) {
_last_growth_cpu_time_ns += cpu_time_ns;
_last_scan_rows_num += scan_rows;
_last_scan_bytes += scan_bytes;
_num_running_io_tasks--;
_is_io_task_running[chunk_source_index] = false;
}
Status ScanOperator::_trigger_next_scan(RuntimeState* state, int chunk_source_index) {
if (!_try_to_increase_committed_scan_tasks()) {
ChunkBufferTokenPtr buffer_token;
if (buffer_token = _buffer_limiter->pin(1); buffer_token == nullptr) {
return Status::OK();
}
_chunk_sources[chunk_source_index]->pin_chunk_token(std::move(buffer_token));
_num_running_io_tasks++;
_is_io_task_running[chunk_source_index] = true;
@ -231,33 +255,39 @@ Status ScanOperator::_trigger_next_scan(RuntimeState* state, int chunk_source_in
if (is_uninitialized(_query_ctx)) {
_query_ctx = state->exec_env()->query_context_mgr()->get(state->query_id());
}
int32_t driver_id = CurrentThread::current().get_driver_id();
if (_workgroup != nullptr) {
workgroup::ScanTask task = workgroup::ScanTask(_workgroup, [wp = _query_ctx, this, state,
chunk_source_index](int worker_id) {
if (auto sp = wp.lock()) {
{
SCOPED_THREAD_LOCAL_MEM_TRACKER_SETTER(state->instance_mem_tracker());
size_t num_read_chunks = 0;
Status status = _chunk_sources[chunk_source_index]->buffer_next_batch_chunks_blocking_for_workgroup(
_buffer_size, state, &num_read_chunks, worker_id, _workgroup);
if (!status.ok() && !status.is_end_of_file()) {
_set_scan_status(status);
workgroup::ScanTask task = workgroup::ScanTask(
_workgroup, [wp = _query_ctx, this, state, chunk_source_index, driver_id](int worker_id) {
if (auto sp = wp.lock()) {
// Set driver_id here to share some driver-local contents.
// Current it's used by ExprContext's driver-local state
CurrentThread::current().set_pipeline_driver_id(driver_id);
DeferOp defer([]() { CurrentThread::current().set_pipeline_driver_id(0); });
SCOPED_THREAD_LOCAL_MEM_TRACKER_SETTER(state->instance_mem_tracker());
auto& chunk_source = _chunk_sources[chunk_source_index];
size_t num_read_chunks = 0;
int64_t prev_cpu_time = chunk_source->get_cpu_time_spent();
int64_t prev_scan_rows = chunk_source->get_scan_rows();
int64_t prev_scan_bytes = chunk_source->get_scan_bytes();
// Read chunk
Status status = chunk_source->buffer_next_batch_chunks_blocking_for_workgroup(
_buffer_size, state, &num_read_chunks, worker_id, _workgroup);
if (!status.ok() && !status.is_end_of_file()) {
_set_scan_status(status);
}
int64_t delta_cpu_time = chunk_source->get_cpu_time_spent() - prev_cpu_time;
_workgroup->increment_real_runtime_ns(delta_cpu_time);
_workgroup->incr_period_scaned_chunk_num(num_read_chunks);
_finish_chunk_source_task(state, chunk_source_index, delta_cpu_time,
chunk_source->get_scan_rows() - prev_scan_rows,
chunk_source->get_scan_bytes() - prev_scan_bytes);
}
// TODO (by laotan332): More detailed information is needed
_workgroup->incr_period_scaned_chunk_num(num_read_chunks);
_workgroup->increment_real_runtime_ns(_chunk_sources[chunk_source_index]->last_spent_cpu_time_ns());
_last_growth_cpu_time_ns += _chunk_sources[chunk_source_index]->last_spent_cpu_time_ns();
_last_scan_rows_num += _chunk_sources[chunk_source_index]->last_scan_rows_num();
_last_scan_bytes += _chunk_sources[chunk_source_index]->last_scan_bytes();
}
_decrease_committed_scan_tasks();
_num_running_io_tasks--;
_is_io_task_running[chunk_source_index] = false;
}
});
});
if (dynamic_cast<ConnectorScanOperator*>(this) != nullptr) {
offer_task_success = ExecEnv::GetInstance()->hdfs_scan_executor()->submit(std::move(task));
} else {
@ -265,23 +295,30 @@ Status ScanOperator::_trigger_next_scan(RuntimeState* state, int chunk_source_in
}
} else {
PriorityThreadPool::Task task;
task.work_function = [wp = _query_ctx, this, state, chunk_source_index]() {
task.work_function = [wp = _query_ctx, this, state, chunk_source_index, driver_id]() {
if (auto sp = wp.lock()) {
{
SCOPED_THREAD_LOCAL_MEM_TRACKER_SETTER(state->instance_mem_tracker());
Status status =
_chunk_sources[chunk_source_index]->buffer_next_batch_chunks_blocking(_buffer_size, state);
if (!status.ok() && !status.is_end_of_file()) {
_set_scan_status(status);
}
_last_growth_cpu_time_ns += _chunk_sources[chunk_source_index]->last_spent_cpu_time_ns();
_last_scan_rows_num += _chunk_sources[chunk_source_index]->last_scan_rows_num();
_last_scan_bytes += _chunk_sources[chunk_source_index]->last_scan_bytes();
}
// Set driver_id here to share some driver-local contents.
// Current it's used by ExprContext's driver-local state
CurrentThread::current().set_pipeline_driver_id(driver_id);
DeferOp defer([]() { CurrentThread::current().set_pipeline_driver_id(0); });
_decrease_committed_scan_tasks();
_num_running_io_tasks--;
_is_io_task_running[chunk_source_index] = false;
SCOPED_THREAD_LOCAL_MEM_TRACKER_SETTER(state->instance_mem_tracker());
auto& chunk_source = _chunk_sources[chunk_source_index];
int64_t prev_cpu_time = chunk_source->get_cpu_time_spent();
int64_t prev_scan_rows = chunk_source->get_scan_rows();
int64_t prev_scan_bytes = chunk_source->get_scan_bytes();
Status status =
_chunk_sources[chunk_source_index]->buffer_next_batch_chunks_blocking(_buffer_size, state);
if (!status.ok() && !status.is_end_of_file()) {
_set_scan_status(status);
}
int64_t delta_cpu_time = chunk_source->get_cpu_time_spent() - prev_cpu_time;
_finish_chunk_source_task(state, chunk_source_index, delta_cpu_time,
chunk_source->get_scan_rows() - prev_scan_rows,
chunk_source->get_scan_bytes() - prev_scan_bytes);
}
};
// TODO(by satanson): set a proper priority
@ -293,6 +330,7 @@ Status ScanOperator::_trigger_next_scan(RuntimeState* state, int chunk_source_in
if (offer_task_success) {
_io_task_retry_cnt = 0;
} else {
_chunk_sources[chunk_source_index]->unpin_chunk_token();
_num_running_io_tasks--;
_is_io_task_running[chunk_source_index] = false;
// TODO(hcf) set a proper retry times
@ -315,6 +353,8 @@ Status ScanOperator::_pickup_morsel(RuntimeState* state, int chunk_source_index)
ASSIGN_OR_RETURN(auto morsel, _morsel_queue->try_get());
if (morsel != nullptr) {
COUNTER_UPDATE(_morsels_counter, 1);
_chunk_sources[chunk_source_index] = create_chunk_source(std::move(morsel), chunk_source_index);
auto status = _chunk_sources[chunk_source_index]->prepare(state);
if (!status.ok()) {
@ -341,26 +381,12 @@ void ScanOperator::_merge_chunk_source_profiles() {
_unique_metrics->copy_all_counters_from(merged_profile);
}
bool ScanOperator::_try_to_increase_committed_scan_tasks() {
int old_num = _num_committed_scan_tasks.fetch_add(1);
if (_exceed_max_scan_concurrency(old_num)) {
_decrease_committed_scan_tasks();
return false;
}
return true;
}
bool ScanOperator::_exceed_max_scan_concurrency(int num_committed_scan_tasks) const {
// _max_scan_concurrency takes effect, only when it is positive.
return _max_scan_concurrency > 0 && num_committed_scan_tasks >= _max_scan_concurrency;
}
// ========== ScanOperatorFactory ==========
ScanOperatorFactory::ScanOperatorFactory(int32_t id, ScanNode* scan_node)
ScanOperatorFactory::ScanOperatorFactory(int32_t id, ScanNode* scan_node, ChunkBufferLimiterPtr buffer_limiter)
: SourceOperatorFactory(id, scan_node->name(), scan_node->id()),
_scan_node(scan_node),
_max_scan_concurrency(scan_node->max_scan_concurrency()) {}
_buffer_limiter(std::move(buffer_limiter)) {}
Status ScanOperatorFactory::prepare(RuntimeState* state) {
RETURN_IF_ERROR(OperatorFactory::prepare(state));
@ -385,16 +411,16 @@ pipeline::OpFactories decompose_scan_node_to_pipeline(std::shared_ptr<ScanOperat
ScanNode* scan_node, pipeline::PipelineBuilderContext* context) {
OpFactories ops;
const auto* morsel_queue = context->morsel_queue_of_source_operator(scan_operator.get());
// ScanOperator's degree_of_parallelism is not more than the number of morsels
// If table is empty, then morsel size is zero and we still set degree of parallelism to 1
const auto degree_of_parallelism =
std::min<size_t>(std::max<size_t>(1, morsel_queue->num_morsels()), context->degree_of_parallelism());
scan_operator->set_degree_of_parallelism(degree_of_parallelism);
size_t scan_dop = context->degree_of_parallelism_of_source_operator(scan_operator.get());
scan_operator->set_degree_of_parallelism(scan_dop);
ops.emplace_back(std::move(scan_operator));
if (!scan_node->conjunct_ctxs().empty() || ops.back()->has_runtime_filters()) {
ops.emplace_back(
std::make_shared<ChunkAccumulateOperatorFactory>(context->next_operator_id(), scan_node->id()));
}
size_t limit = scan_node->limit();
if (limit != -1) {
ops.emplace_back(

View File

@ -13,13 +13,20 @@ class ScanNode;
namespace pipeline {
class ChunkBufferLimiter;
using ChunkBufferLimiterPtr = std::unique_ptr<ChunkBufferLimiter>;
class ScanOperator : public SourceOperator {
public:
static constexpr int MAX_IO_TASKS_PER_OP = 4;
ScanOperator(OperatorFactory* factory, int32_t id, int32_t driver_sequence, ScanNode* scan_node,
int max_scan_concurrency, std::atomic<int>& num_committed_scan_tasks);
ChunkBufferLimiter* buffer_limiter);
~ScanOperator() override;
static size_t max_buffer_capacity() { return config::pipeline_io_buffer_size; }
Status prepare(RuntimeState* state) override;
// The running I/O task committed by ScanOperator holds the reference of query context,
@ -49,17 +56,8 @@ public:
virtual void do_close(RuntimeState* state) = 0;
virtual ChunkSourcePtr create_chunk_source(MorselPtr morsel, int32_t chunk_source_index) = 0;
virtual int64_t get_last_scan_rows_num() {
int64_t scan_rows_num = _last_scan_rows_num;
_last_scan_rows_num = 0;
return scan_rows_num;
}
virtual int64_t get_last_scan_bytes() {
int64_t res = _last_scan_bytes;
_last_scan_bytes = 0;
return res;
}
int64_t get_last_scan_rows_num() { return _last_scan_rows_num.exchange(0); }
int64_t get_last_scan_bytes() { return _last_scan_bytes.exchange(0); }
private:
// This method is only invoked when current morsel is reached eof
@ -67,6 +65,8 @@ private:
Status _pickup_morsel(RuntimeState* state, int chunk_source_index);
Status _trigger_next_scan(RuntimeState* state, int chunk_source_index);
Status _try_to_trigger_next_scan(RuntimeState* state);
void _finish_chunk_source_task(RuntimeState* state, int chunk_source_index, int64_t cpu_time_ns, int64_t scan_rows,
int64_t scan_bytes);
void _merge_chunk_source_profiles();
inline void _set_scan_status(const Status& status) {
@ -81,10 +81,6 @@ private:
return _scan_status;
}
bool _try_to_increase_committed_scan_tasks();
void _decrease_committed_scan_tasks() { _num_committed_scan_tasks.fetch_sub(1); }
bool _exceed_max_scan_concurrency(int num_committed_scan_tasks) const;
protected:
ScanNode* _scan_node = nullptr;
// ScanOperator may do parallel scan, so each _chunk_sources[i] needs to hold
@ -95,14 +91,10 @@ protected:
std::vector<std::shared_ptr<RuntimeProfile>> _chunk_source_profiles;
bool _is_finished = false;
const int _max_scan_concurrency;
// Shared by all the ScanOperators created by the same ScanOperatorFactory.
std::atomic<int>& _num_committed_scan_tasks;
// Shared among scan operators decomposed from a scan node, and owned by ScanOperatorFactory.
ChunkBufferLimiter* _buffer_limiter;
private:
static constexpr int MAX_IO_TASKS_PER_OP = 4;
const size_t _buffer_size = config::pipeline_io_buffer_size;
int32_t _io_task_retry_cnt = 0;
@ -118,11 +110,20 @@ private:
workgroup::WorkGroupPtr _workgroup = nullptr;
std::atomic_int64_t _last_scan_rows_num = 0;
std::atomic_int64_t _last_scan_bytes = 0;
RuntimeProfile::Counter* _default_buffer_capacity_counter = nullptr;
RuntimeProfile::Counter* _buffer_capacity_counter = nullptr;
RuntimeProfile::HighWaterMarkCounter* _peak_buffer_size_counter = nullptr;
// The total number of the original tablets in this fragment instance.
RuntimeProfile::Counter* _tablets_counter = nullptr;
// The number of morsels picked up by this scan operator.
// A tablet may be divided into multiple morsels.
RuntimeProfile::Counter* _morsels_counter = nullptr;
};
class ScanOperatorFactory : public SourceOperatorFactory {
public:
ScanOperatorFactory(int32_t id, ScanNode* scan_node);
ScanOperatorFactory(int32_t id, ScanNode* scan_node, ChunkBufferLimiterPtr buffer_limiter);
~ScanOperatorFactory() override = default;
@ -140,9 +141,7 @@ public:
protected:
ScanNode* const _scan_node;
const int _max_scan_concurrency;
std::atomic<int> _num_committed_scan_tasks{0};
ChunkBufferLimiterPtr _buffer_limiter;
};
pipeline::OpFactories decompose_scan_node_to_pipeline(std::shared_ptr<ScanOperatorFactory> factory, ScanNode* scan_node,

View File

@ -12,6 +12,8 @@ Status SelectOperator::prepare(RuntimeState* state) {
}
void SelectOperator::close(RuntimeState* state) {
_curr_chunk.reset();
_pre_output_chunk.reset();
Operator::close(state);
}

View File

@ -21,6 +21,7 @@ Status ExceptContext::prepare(RuntimeState* state, const std::vector<ExprContext
}
void ExceptContext::close(RuntimeState* state) {
_hash_set.reset();
if (_build_pool != nullptr) {
_build_pool->free_all();
}

View File

@ -29,10 +29,12 @@ class ExceptContext final : public ContextWithDependency {
public:
explicit ExceptContext(const int dst_tuple_id) : _dst_tuple_id(dst_tuple_id) {}
bool is_ht_empty() const { return _hash_set->empty(); }
bool is_ht_empty() const { return _is_hash_set_empty; }
void finish_build_ht() {
_is_hash_set_empty = _hash_set->empty();
_next_processed_iter = _hash_set->begin();
_hash_set_end_iter = _hash_set->end();
_finished_dependency_index.fetch_add(1, std::memory_order_release);
}
@ -42,7 +44,7 @@ public:
return _finished_dependency_index.load(std::memory_order_acquire) == dependency_index;
}
bool is_output_finished() const { return _next_processed_iter == _hash_set->end(); }
bool is_output_finished() const { return _next_processed_iter == _hash_set_end_iter; }
// Called in the preparation phase of ExceptBuildSinkOperator.
Status prepare(RuntimeState* state, const std::vector<ExprContext*>& build_exprs);
@ -77,6 +79,8 @@ private:
// Used for traversal on the hash set to get the undeleted keys to dest chunk.
// Init when the hash set is finished building in finish_build_ht().
vectorized::ExceptHashSerializeSet::Iterator _next_processed_iter;
vectorized::ExceptHashSerializeSet::Iterator _hash_set_end_iter;
bool _is_hash_set_empty = false;
// The BUILD, PROBES, and OUTPUT operators execute sequentially.
// BUILD -> 1-th PROBE -> 2-th PROBE -> ... -> n-th PROBE -> OUTPUT.

View File

@ -20,6 +20,7 @@ Status IntersectContext::prepare(RuntimeState* state, const std::vector<ExprCont
}
void IntersectContext::close(RuntimeState* state) {
_hash_set.reset();
if (_build_pool != nullptr) {
_build_pool->free_all();
}

View File

@ -31,10 +31,12 @@ public:
IntersectContext(const int dst_tuple_id, const size_t intersect_times)
: _dst_tuple_id(dst_tuple_id), _intersect_times(intersect_times) {}
bool is_ht_empty() const { return _hash_set->empty(); }
bool is_ht_empty() const { return _is_hash_set_empty; }
void finish_build_ht() {
_is_hash_set_empty = _hash_set->empty();
_next_processed_iter = _hash_set->begin();
_hash_set_end_iter = _hash_set->end();
_finished_dependency_index.fetch_add(1, std::memory_order_release);
}
@ -44,7 +46,7 @@ public:
return _finished_dependency_index.load(std::memory_order_acquire) == dependency_index;
}
bool is_output_finished() const { return _next_processed_iter == _hash_set->end(); }
bool is_output_finished() const { return _next_processed_iter == _hash_set_end_iter; }
// Called in the preparation phase of IntersectBuildSinkOperator.
Status prepare(RuntimeState* state, const std::vector<ExprContext*>& build_exprs);
@ -80,6 +82,8 @@ private:
// Used for traversal on the hash set to get the undeleted keys to dest chunk.
// Init when the hash set is finished building in finish_build_ht().
vectorized::IntersectHashSerializeSet::Iterator _next_processed_iter;
vectorized::IntersectHashSerializeSet::Iterator _hash_set_end_iter;
bool _is_hash_set_empty = false;
// The BUILD, PROBES, and OUTPUT operators execute sequentially.
// BUILD -> 1-th PROBE -> 2-th PROBE -> ... -> n-th PROBE -> OUTPUT.

View File

@ -25,6 +25,7 @@ Status PartitionSortSinkOperator::prepare(RuntimeState* state) {
void PartitionSortSinkOperator::close(RuntimeState* state) {
_sort_context->unref(state);
_chunks_sorter.reset();
Operator::close(state);
}

View File

@ -13,6 +13,11 @@ using vectorized::Columns;
using vectorized::SortedRun;
using vectorized::SortedRuns;
void SortContext::close(RuntimeState* state) {
_chunks_sorter_partions.clear();
_merged_runs.clear();
}
StatusOr<ChunkPtr> SortContext::pull_chunk() {
if (!_is_merge_finish) {
_merge_inputs();
@ -29,7 +34,9 @@ StatusOr<ChunkPtr> SortContext::pull_chunk() {
SortedRun& run = _merged_runs.front();
ChunkPtr res = run.steal_chunk(required_rows);
RETURN_IF_ERROR(res->downgrade());
if (res != nullptr) {
RETURN_IF_ERROR(res->downgrade());
}
if (run.empty()) {
_merged_runs.pop_front();

View File

@ -36,7 +36,7 @@ public:
_chunks_sorter_partions.reserve(num_right_sinkers);
}
void close(RuntimeState* state) override {}
void close(RuntimeState* state) override;
void add_partition_chunks_sorter(std::shared_ptr<ChunksSorter> chunks_sorter) {
_chunks_sorter_partions.push_back(chunks_sorter);

View File

@ -92,9 +92,6 @@ public:
const std::string& name() const { return _name; }
// Used by pipeline, 0 means there is no limitation.
virtual int max_scan_concurrency() const { return 0; }
protected:
RuntimeProfile::Counter* _bytes_read_counter; // # bytes read from the scanner
// # rows/tuples read from the scanner (including those discarded by eval_conjucts())

View File

@ -133,7 +133,7 @@ Status NodeChannel::init(RuntimeState* state) {
return Status::OK();
}
void NodeChannel::open() {
void NodeChannel::try_open() {
PTabletWriterOpenRequest request;
request.set_allocated_id(&_parent->_load_id);
request.set_index_id(_index_id);
@ -179,6 +179,15 @@ void NodeChannel::open() {
request.release_schema();
}
bool NodeChannel::is_open_done() {
if (_open_closure != nullptr) {
// open request already finished
return (_open_closure->count() != 2);
}
return true;
}
Status NodeChannel::open_wait() {
_open_closure->join();
if (_open_closure->cntl.Failed()) {
@ -207,7 +216,11 @@ Status NodeChannel::_serialize_chunk(const vectorized::Chunk* src, ChunkPB* dst)
{
SCOPED_RAW_TIMER(&_serialize_batch_ns);
StatusOr<ChunkPB> res = serde::ProtobufChunkSerde::serialize(*src);
if (!res.ok()) return res.status();
if (!res.ok()) {
_cancelled = true;
_err_st = res.status();
return _err_st;
}
res->Swap(dst);
}
DCHECK(dst->has_uncompressed_size());
@ -216,12 +229,14 @@ Status NodeChannel::_serialize_chunk(const vectorized::Chunk* src, ChunkPB* dst)
size_t uncompressed_size = dst->uncompressed_size();
if (_compress_codec != nullptr && _compress_codec->exceed_max_input_size(uncompressed_size)) {
return Status::InternalError(fmt::format("The input size for compression should be less than {}",
_compress_codec->max_input_size()));
_cancelled = true;
_err_st = Status::InternalError(fmt::format("The input size for compression should be less than {}",
_compress_codec->max_input_size()));
return _err_st;
}
// try compress the ChunkPB data
if (_compress_codec != nullptr && uncompressed_size > 0) {
if (config::table_sink_compression_enable && _compress_codec != nullptr && uncompressed_size > 0) {
SCOPED_TIMER(_parent->_compress_timer);
// Try compressing data to _compression_scratch, swap if compressed data is smaller
@ -246,6 +261,15 @@ Status NodeChannel::_serialize_chunk(const vectorized::Chunk* src, ChunkPB* dst)
return Status::OK();
}
bool NodeChannel::is_full() {
if (_chunk_queue.size() >= _max_chunk_queue_size || _mem_tracker->limit()) {
if (!_check_prev_request_done()) {
return true;
}
}
return false;
}
Status NodeChannel::add_chunk(vectorized::Chunk* input, const int64_t* tablet_ids, const uint32_t* indexes,
uint32_t from, uint32_t size, bool eos) {
if (_cancelled || _send_finished) {
@ -258,6 +282,12 @@ Status NodeChannel::add_chunk(vectorized::Chunk* input, const int64_t* tablet_id
_cur_chunk = input->clone_empty_with_slot();
}
if (is_full()) {
// wait previous request done then we can pop data from queue to send request
// and make new space to push data.
RETURN_IF_ERROR(_wait_one_prev_request());
}
// 1. append data
_cur_chunk->append_selective(*input, indexes, from, size);
for (size_t i = 0; i < size; ++i) {
@ -280,13 +310,8 @@ Status NodeChannel::add_chunk(vectorized::Chunk* input, const int64_t* tablet_id
// 4. check last request
if (!_check_prev_request_done()) {
if (_chunk_queue.size() > _max_chunk_queue_size || _mem_tracker->limit()) {
// 4.1 wait if queue full
RETURN_IF_ERROR(_wait_one_prev_request());
} else {
// 4.2 noblock here so that channel cant send data
return Status::OK();
}
// 4.1 noblock here so that other node channel can send data
return Status::OK();
}
} else {
@ -418,6 +443,20 @@ bool NodeChannel::_check_prev_request_done() {
return false;
}
bool NodeChannel::_check_all_prev_request_done() {
if (UNLIKELY(_next_packet_seq == 0)) {
return true;
}
for (size_t i = 0; i < _max_parallel_request_size; i++) {
if (_add_batch_closures[i]->count() != 1) {
return false;
}
}
return true;
}
Status NodeChannel::_wait_one_prev_request() {
SCOPED_TIMER(_parent->_wait_response_timer);
if (_next_packet_seq == 0) {
@ -448,6 +487,27 @@ Status NodeChannel::_wait_one_prev_request() {
return Status::OK();
}
Status NodeChannel::try_close() {
if (_cancelled || _send_finished) {
return _err_st;
}
if (_check_prev_request_done()) {
auto st = add_chunk(nullptr, nullptr, nullptr, 0, 0, true);
if (!st.ok()) {
_cancelled = true;
_err_st = st;
return _err_st;
}
}
return Status::OK();
}
bool NodeChannel::is_close_done() {
return (_send_finished && _check_all_prev_request_done()) || _cancelled;
}
Status NodeChannel::close_wait(RuntimeState* state) {
if (_cancelled) {
return _err_st;
@ -470,8 +530,11 @@ Status NodeChannel::close_wait(RuntimeState* state) {
}
void NodeChannel::cancel(const Status& err_st) {
// we don't need to wait last rpc finished, cause closure's release/reset will join.
// But do we need brpc::StartCancel(call_id)?
// cancel rpc request, accelerate the release of related resources
for (auto closure : _add_batch_closures) {
closure->cancel();
}
_cancelled = true;
_err_st = err_st;
@ -577,14 +640,28 @@ Status OlapTableSink::init(const TDataSink& t_sink) {
}
Status OlapTableSink::prepare(RuntimeState* state) {
// profile must add to state's object pool
_profile = state->obj_pool()->add(new RuntimeProfile("OlapTableSink"));
// add all counter
_input_rows_counter = ADD_COUNTER(_profile, "RowsRead", TUnit::UNIT);
_output_rows_counter = ADD_COUNTER(_profile, "RowsReturned", TUnit::UNIT);
_filtered_rows_counter = ADD_COUNTER(_profile, "RowsFiltered", TUnit::UNIT);
_send_data_timer = ADD_TIMER(_profile, "SendDataTime");
_convert_chunk_timer = ADD_TIMER(_profile, "ConvertChunkTime");
_validate_data_timer = ADD_TIMER(_profile, "ValidateDataTime");
_open_timer = ADD_TIMER(_profile, "OpenTime");
_close_timer = ADD_TIMER(_profile, "CloseWaitTime");
_serialize_chunk_timer = ADD_TIMER(_profile, "SerializeChunkTime");
_wait_response_timer = ADD_TIMER(_profile, "WaitResponseTime");
_compress_timer = ADD_TIMER(_profile, "CompressTime");
_pack_chunk_timer = ADD_TIMER(_profile, "PackChunkTime");
RETURN_IF_ERROR(DataSink::prepare(state));
_sender_id = state->per_fragment_instance_idx();
_num_senders = state->num_per_fragment_instances();
// profile must add to state's object pool
_profile = state->obj_pool()->add(new RuntimeProfile("OlapTableSink"));
SCOPED_TIMER(_profile->total_time_counter());
// Prepare the exprs to run.
@ -643,22 +720,6 @@ Status OlapTableSink::prepare(RuntimeState* state) {
}
}
// add all counter
_input_rows_counter = ADD_COUNTER(_profile, "RowsRead", TUnit::UNIT);
_output_rows_counter = ADD_COUNTER(_profile, "RowsReturned", TUnit::UNIT);
_filtered_rows_counter = ADD_COUNTER(_profile, "RowsFiltered", TUnit::UNIT);
_send_data_timer = ADD_TIMER(_profile, "SendDataTime");
_convert_chunk_timer = ADD_TIMER(_profile, "ConvertChunkTime");
_validate_data_timer = ADD_TIMER(_profile, "ValidateDataTime");
_open_timer = ADD_TIMER(_profile, "OpenTime");
_close_timer = ADD_TIMER(_profile, "CloseWaitTime");
_serialize_chunk_timer = ADD_TIMER(_profile, "SerializeChunkTime");
_wait_response_timer = ADD_TIMER(_profile, "WaitResponseTime");
_compress_timer = ADD_TIMER(_profile, "CompressTime");
_append_attachment_timer = ADD_TIMER(_profile, "AppendAttachmentTime");
_mark_tablet_timer = ADD_TIMER(_profile, "MarkTabletTime");
_pack_chunk_timer = ADD_TIMER(_profile, "PackChunkTime");
_load_mem_limit = state->get_load_mem_limit();
// open all channels
@ -686,13 +747,33 @@ Status OlapTableSink::prepare(RuntimeState* state) {
Status OlapTableSink::open(RuntimeState* state) {
SCOPED_TIMER(_profile->total_time_counter());
SCOPED_TIMER(_open_timer);
RETURN_IF_ERROR(try_open(state));
RETURN_IF_ERROR(open_wait());
return Status::OK();
}
Status OlapTableSink::try_open(RuntimeState* state) {
// Prepare the exprs to run.
RETURN_IF_ERROR(Expr::open(_output_expr_ctxs, state));
for (auto& index_channel : _channels) {
index_channel->for_each_node_channel([](NodeChannel* ch) { ch->open(); });
index_channel->for_each_node_channel([](NodeChannel* ch) { ch->try_open(); });
}
return Status::OK();
}
bool OlapTableSink::is_open_done() {
bool open_done = true;
for (auto& index_channel : _channels) {
index_channel->for_each_node_channel([&open_done](NodeChannel* ch) { open_done &= ch->is_open_done(); });
}
return open_done;
}
Status OlapTableSink::open_wait() {
Status err_st = Status::OK();
for (auto& index_channel : _channels) {
index_channel->for_each_node_channel([&index_channel, &err_st](NodeChannel* ch) {
@ -831,6 +912,15 @@ Status OlapTableSink::send_chunk(RuntimeState* state, vectorized::Chunk* chunk)
return Status::OK();
}
bool OlapTableSink::is_full() {
bool full = false;
for (auto& index_channel : _channels) {
index_channel->for_each_node_channel([&full](NodeChannel* ch) { full |= ch->is_full(); });
}
return full;
}
Status OlapTableSink::_send_chunk_by_node(vectorized::Chunk* chunk, IndexChannel* channel,
std::vector<uint16_t>& selection_idx) {
Status err_st = Status::OK();
@ -862,7 +952,52 @@ Status OlapTableSink::_send_chunk_by_node(vectorized::Chunk* chunk, IndexChannel
return Status::OK();
}
Status OlapTableSink::try_close(RuntimeState* state) {
Status err_st = Status::OK();
bool intolerable_failure = false;
for (auto& index_channel : _channels) {
index_channel->for_each_node_channel([&index_channel, &err_st, &intolerable_failure](NodeChannel* ch) {
auto st = ch->try_close();
if (!st.ok()) {
LOG(WARNING) << "close channel failed. channel_name=" << ch->name()
<< ", load_info=" << ch->print_load_info() << ", error_msg=" << st.get_error_msg();
err_st = st;
index_channel->mark_as_failed(ch);
}
if (index_channel->has_intolerable_failure()) {
intolerable_failure = true;
}
});
}
if (intolerable_failure) {
return err_st;
} else {
return Status::OK();
}
}
bool OlapTableSink::is_close_done() {
bool close_done = true;
for (auto& index_channel : _channels) {
index_channel->for_each_node_channel([&close_done](NodeChannel* ch) { close_done &= ch->is_close_done(); });
}
return close_done;
}
Status OlapTableSink::close(RuntimeState* state, Status close_status) {
if (close_status.ok()) {
do {
close_status = try_close(state);
if (!close_status.ok()) break;
SleepFor(MonoDelta::FromMilliseconds(5));
} while (!is_close_done());
}
return close_wait(state, close_status);
}
Status OlapTableSink::close_wait(RuntimeState* state, Status close_status) {
Status status = close_status;
if (status.ok()) {
// only if status is ok can we call this _profile->total_time_counter().
@ -983,7 +1118,6 @@ void _print_decimalv3_error_msg(RuntimeState* state, const CppType& decimal, con
if (state->has_reached_max_error_msg_num()) {
return;
}
std::stringstream ss;
auto decimal_str = DecimalV3Cast::to_string<CppType>(decimal, desc->type().precision, desc->type().scale);
std::string error_msg = strings::Substitute("Decimal '$0' is out of range. The type of '$1' is $2'", decimal_str,
desc->col_name(), desc->type().debug_string());
@ -1004,8 +1138,8 @@ void OlapTableSink::_validate_decimal(RuntimeState* state, vectorized::Column* c
auto* data = &data_column->get_data().front();
int precision = desc->type().precision;
const auto max_decimal = get_scale_factor<CppType>(precision);
const auto min_decimal = -max_decimal;
const auto max_decimal = get_max_decimal<CppType>(precision);
const auto min_decimal = get_min_decimal<CppType>(precision);
for (auto i = 0; i < num_rows; ++i) {
if ((*validate_selection)[i] == VALID_SEL_OK) {

View File

@ -81,7 +81,7 @@ template <typename T>
class ReusableClosure : public google::protobuf::Closure {
public:
ReusableClosure() : cid(INVALID_BTHREAD_ID), _refs(0) {}
~ReusableClosure() { join(); }
~ReusableClosure() {}
int count() { return _refs.load(); }
@ -106,6 +106,12 @@ public:
}
}
void cancel() {
if (cid != INVALID_BTHREAD_ID) {
brpc::StartCancel(cid);
}
}
void reset() {
cntl.Reset();
cid = cntl.call_id();
@ -129,13 +135,24 @@ public:
Status init(RuntimeState* state);
// we use open/open_wait to parallel
void open();
// async open interface: try_open() -> [is_open_done()] -> open_wait()
// if is_open_done() return true, open_wait() will not block
// otherwise open_wait() will block
void try_open();
bool is_open_done();
Status open_wait();
// async add chunk interface
// if is_full() return false, add_chunk() will not block
Status add_chunk(vectorized::Chunk* chunk, const int64_t* tablet_ids, const uint32_t* indexes, uint32_t from,
uint32_t size, bool eos);
bool is_full();
// async close interface: try_close() -> [is_close_done()] -> close_wait()
// if is_close_done() return true, close_wait() will not block
// otherwise close_wait() will block
Status try_close();
bool is_close_done();
Status close_wait(RuntimeState* state);
void cancel(const Status& err_st);
@ -163,6 +180,7 @@ private:
Status _wait_all_prev_request();
Status _wait_one_prev_request();
bool _check_prev_request_done();
bool _check_all_prev_request_done();
Status _serialize_chunk(const vectorized::Chunk* src, ChunkPB* dst);
std::unique_ptr<MemTracker> _mem_tracker = nullptr;
@ -266,11 +284,34 @@ public:
Status prepare(RuntimeState* state) override;
// sync open interface
Status open(RuntimeState* state) override;
// async open interface: try_open() -> [is_open_done()] -> open_wait()
// if is_open_done() return true, open_wait() will not block
// otherwise open_wait() will block
Status try_open(RuntimeState* state);
bool is_open_done();
Status open_wait();
// async add chunk interface
// if is_full() return false, add_chunk() will not block
Status send_chunk(RuntimeState* state, vectorized::Chunk* chunk) override;
// close() will send RPCs too. If RPCs failed, return error.
bool is_full();
// async close interface: try_close() -> [is_close_done()] -> close_wait()
// if is_close_done() return true, close_wait() will not block
// otherwise close_wait() will block
Status try_close(RuntimeState* state);
bool is_close_done();
Status close_wait(RuntimeState* state, Status close_status);
// sync close() interface
Status close(RuntimeState* state, Status close_status) override;
// Returns the runtime profile for the sink.

View File

@ -312,6 +312,7 @@ struct AggHashMapWithOneNullableNumberKey {
template <typename HashMap>
struct AggHashMapWithOneStringKey {
using KeyType = typename HashMap::key_type;
using Iterator = typename HashMap::iterator;
using ResultVector = typename std::vector<Slice>;
HashMap hash_map;
@ -400,6 +401,7 @@ struct AggHashMapWithOneStringKey {
template <typename HashMap>
struct AggHashMapWithOneNullableStringKey {
using KeyType = typename HashMap::key_type;
using Iterator = typename HashMap::iterator;
using ResultVector = typename std::vector<Slice>;
HashMap hash_map;
@ -527,6 +529,7 @@ struct AggHashMapWithOneNullableStringKey {
template <typename HashMap>
struct AggHashMapWithSerializedKey {
using KeyType = typename HashMap::key_type;
using Iterator = typename HashMap::iterator;
using ResultVector = typename std::vector<Slice>;
HashMap hash_map;
@ -643,6 +646,7 @@ struct AggHashMapWithSerializedKey {
template <typename HashMap>
struct AggHashMapWithSerializedKeyFixedSize {
using KeyType = typename HashMap::key_type;
using Iterator = typename HashMap::iterator;
using FixedSizeSliceKey = typename HashMap::key_type;
using ResultVector = typename std::vector<FixedSizeSliceKey>;

View File

@ -402,14 +402,29 @@ struct AggHashMapVariant {
return 0;
}
size_t memory_usage() const {
size_t reserved_memory_usage(const MemPool* pool) const {
switch (type) {
#define M(NAME) \
case Type::NAME: \
return NAME->hash_map.dump_bound();
return NAME->hash_map.dump_bound() + pool->total_reserved_bytes();
APPLY_FOR_AGG_VARIANT_ALL(M)
#undef M
}
return 0;
}
size_t allocated_memory_usage(const MemPool* pool) const {
switch (type) {
#define M(NAME) \
case Type::NAME: \
return sizeof(decltype(NAME)::element_type::KeyType) * NAME->hash_map.size() + pool->total_allocated_bytes();
APPLY_FOR_AGG_VARIANT_ALL(M)
#undef M
}
return 0;
}
};
@ -676,16 +691,29 @@ struct AggHashSetVariant {
return 0;
}
size_t memory_usage() const {
size_t reserved_memory_usage(const MemPool* pool) const {
switch (type) {
#define M(NAME) \
case Type::NAME: \
return NAME->hash_set.dump_bound();
return NAME->hash_set.dump_bound() + pool->total_reserved_bytes();
APPLY_FOR_AGG_VARIANT_ALL(M)
#undef M
}
return 0;
}
size_t allocated_memory_usage(const MemPool* pool) const {
switch (type) {
#define M(NAME) \
case Type::NAME: \
return sizeof(decltype(NAME)::element_type::KeyType) * NAME->hash_set.size() + pool->total_allocated_bytes();
APPLY_FOR_AGG_VARIANT_ALL(M)
#undef M
}
return 0;
}
};
} // namespace starrocks::vectorized

View File

@ -4,6 +4,7 @@
#include "exec/pipeline/aggregate/aggregate_blocking_sink_operator.h"
#include "exec/pipeline/aggregate/aggregate_blocking_source_operator.h"
#include "exec/pipeline/chunk_accumulate_operator.h"
#include "exec/pipeline/exchange/exchange_source_operator.h"
#include "exec/pipeline/limit_operator.h"
#include "exec/pipeline/operator.h"
@ -52,7 +53,7 @@ Status AggregateBlockingNode::open(RuntimeState* state) {
DCHECK_LE(chunk->num_rows(), runtime_state()->chunk_size());
_aggregator->evaluate_exprs(chunk.get());
RETURN_IF_ERROR(_aggregator->evaluate_exprs(chunk.get()));
size_t chunk_size = chunk->num_rows();
{
@ -68,8 +69,7 @@ Status AggregateBlockingNode::open(RuntimeState* state) {
APPLY_FOR_AGG_VARIANT_ALL(HASH_MAP_METHOD)
#undef HASH_MAP_METHOD
_mem_tracker->set(_aggregator->hash_map_variant().memory_usage() +
_aggregator->mem_pool()->total_reserved_bytes());
_mem_tracker->set(_aggregator->hash_map_variant().reserved_memory_usage(_aggregator->mem_pool()));
TRY_CATCH_BAD_ALLOC(_aggregator->try_convert_to_two_level_map());
}
if (_aggregator->is_none_group_by_exprs()) {
@ -119,7 +119,7 @@ Status AggregateBlockingNode::open(RuntimeState* state) {
COUNTER_SET(_aggregator->input_row_count(), _aggregator->num_input_rows());
_mem_tracker->set(_aggregator->hash_map_variant().memory_usage() + _aggregator->mem_pool()->total_reserved_bytes());
_mem_tracker->set(_aggregator->hash_map_variant().reserved_memory_usage(_aggregator->mem_pool()));
return Status::OK();
}
@ -170,7 +170,8 @@ Status AggregateBlockingNode::get_next(RuntimeState* state, ChunkPtr* chunk, boo
std::vector<std::shared_ptr<pipeline::OperatorFactory> > AggregateBlockingNode::decompose_to_pipeline(
pipeline::PipelineBuilderContext* context) {
using namespace pipeline;
OpFactories operators_with_sink = _children[0]->decompose_to_pipeline(context);
OpFactories ops_with_sink = _children[0]->decompose_to_pipeline(context);
auto& agg_node = _tnode.agg_node;
if (agg_node.need_finalize) {
// If finalize aggregate with group by clause, then it can be paralized
@ -182,7 +183,7 @@ std::vector<std::shared_ptr<pipeline::OperatorFactory> > AggregateBlockingNode::
// 2. Otherwise, add LocalExchangeOperator
// to shuffle multi-stream into #degree_of_parallelism# streams each of that pipes into AggregateBlockingSinkOperator.
bool need_local_shuffle = true;
if (auto* exchange_op = dynamic_cast<ExchangeSourceOperatorFactory*>(operators_with_sink[0].get());
if (auto* exchange_op = dynamic_cast<ExchangeSourceOperatorFactory*>(ops_with_sink[0].get());
exchange_op != nullptr) {
auto& texchange_node = exchange_op->texchange_node();
DCHECK(texchange_node.__isset.partition_type);
@ -194,18 +195,16 @@ std::vector<std::shared_ptr<pipeline::OperatorFactory> > AggregateBlockingNode::
if (need_local_shuffle) {
std::vector<ExprContext*> group_by_expr_ctxs;
Expr::create_expr_trees(_pool, _tnode.agg_node.grouping_exprs, &group_by_expr_ctxs);
operators_with_sink = context->maybe_interpolate_local_shuffle_exchange(
runtime_state(), operators_with_sink, group_by_expr_ctxs);
ops_with_sink = context->maybe_interpolate_local_shuffle_exchange(runtime_state(), ops_with_sink,
group_by_expr_ctxs);
}
} else {
operators_with_sink =
context->maybe_interpolate_local_passthrough_exchange(runtime_state(), operators_with_sink);
ops_with_sink = context->maybe_interpolate_local_passthrough_exchange(runtime_state(), ops_with_sink);
}
}
// We cannot get degree of parallelism from PipelineBuilderContext, of which is only a suggest value
// and we may set other parallelism for source operator in many special cases
size_t degree_of_parallelism =
down_cast<SourceOperatorFactory*>(operators_with_sink[0].get())->degree_of_parallelism();
size_t degree_of_parallelism = down_cast<SourceOperatorFactory*>(ops_with_sink[0].get())->degree_of_parallelism();
// shared by sink operator and source operator
AggregatorFactoryPtr aggregator_factory = std::make_shared<AggregatorFactory>(_tnode);
@ -216,23 +215,29 @@ std::vector<std::shared_ptr<pipeline::OperatorFactory> > AggregateBlockingNode::
aggregator_factory);
// Initialize OperatorFactory's fields involving runtime filters.
this->init_runtime_filter_for_operator(sink_operator.get(), context, rc_rf_probe_collector);
operators_with_sink.push_back(std::move(sink_operator));
context->add_pipeline(operators_with_sink);
ops_with_sink.push_back(std::move(sink_operator));
context->add_pipeline(ops_with_sink);
OpFactories operators_with_source;
OpFactories ops_with_source;
auto source_operator = std::make_shared<AggregateBlockingSourceOperatorFactory>(context->next_operator_id(), id(),
aggregator_factory);
// Initialize OperatorFactory's fields involving runtime filters.
this->init_runtime_filter_for_operator(source_operator.get(), context, rc_rf_probe_collector);
// Aggregator must be used by a pair of sink and source operators,
// so operators_with_source's degree of parallelism must be equal with operators_with_sink's
// so ops_with_source's degree of parallelism must be equal with ops_with_sink's
source_operator->set_degree_of_parallelism(degree_of_parallelism);
operators_with_source.push_back(std::move(source_operator));
ops_with_source.push_back(std::move(source_operator));
if (!_tnode.conjuncts.empty() || ops_with_source.back()->has_runtime_filters()) {
ops_with_source.emplace_back(
std::make_shared<ChunkAccumulateOperatorFactory>(context->next_operator_id(), id()));
}
if (limit() != -1) {
operators_with_source.emplace_back(
ops_with_source.emplace_back(
std::make_shared<LimitOperatorFactory>(context->next_operator_id(), id(), limit()));
}
return operators_with_source;
return ops_with_source;
}
} // namespace starrocks::vectorized

View File

@ -58,7 +58,7 @@ Status AggregateStreamingNode::get_next(RuntimeState* state, ChunkPtr* chunk, bo
size_t input_chunk_size = input_chunk->num_rows();
_aggregator->update_num_input_rows(input_chunk_size);
COUNTER_SET(_aggregator->input_row_count(), _aggregator->num_input_rows());
_aggregator->evaluate_exprs(input_chunk.get());
RETURN_IF_ERROR(_aggregator->evaluate_exprs(input_chunk.get()));
if (_aggregator->streaming_preaggregation_mode() == TStreamingPreaggregationMode::FORCE_STREAMING) {
// force execute streaming
@ -88,8 +88,7 @@ Status AggregateStreamingNode::get_next(RuntimeState* state, ChunkPtr* chunk, bo
_aggregator->compute_batch_agg_states(input_chunk_size);
}
_mem_tracker->set(_aggregator->hash_map_variant().memory_usage() +
_aggregator->mem_pool()->total_reserved_bytes());
_mem_tracker->set(_aggregator->hash_map_variant().reserved_memory_usage(_aggregator->mem_pool()));
TRY_CATCH_BAD_ALLOC(_aggregator->try_convert_to_two_level_map());
COUNTER_SET(_aggregator->hash_table_size(), (int64_t)_aggregator->hash_map_variant().size());
@ -136,8 +135,7 @@ Status AggregateStreamingNode::get_next(RuntimeState* state, ChunkPtr* chunk, bo
_aggregator->compute_batch_agg_states(input_chunk_size);
}
_mem_tracker->set(_aggregator->hash_map_variant().memory_usage() +
_aggregator->mem_pool()->total_reserved_bytes());
_mem_tracker->set(_aggregator->hash_map_variant().reserved_memory_usage(_aggregator->mem_pool()));
TRY_CATCH_BAD_ALLOC(_aggregator->try_convert_to_two_level_map());
COUNTER_SET(_aggregator->hash_table_size(), (int64_t)_aggregator->hash_map_variant().size());
continue;

View File

@ -4,6 +4,8 @@
#include "exec/pipeline/aggregate/aggregate_distinct_blocking_sink_operator.h"
#include "exec/pipeline/aggregate/aggregate_distinct_blocking_source_operator.h"
#include "exec/pipeline/chunk_accumulate_operator.h"
#include "exec/pipeline/exchange/exchange_source_operator.h"
#include "exec/pipeline/limit_operator.h"
#include "exec/pipeline/operator.h"
#include "exec/pipeline/pipeline_builder.h"
@ -45,7 +47,7 @@ Status DistinctBlockingNode::open(RuntimeState* state) {
}
DCHECK_LE(chunk->num_rows(), runtime_state()->chunk_size());
_aggregator->evaluate_exprs(chunk.get());
RETURN_IF_ERROR(_aggregator->evaluate_exprs(chunk.get()));
{
SCOPED_TIMER(_aggregator->agg_compute_timer());
@ -59,8 +61,7 @@ Status DistinctBlockingNode::open(RuntimeState* state) {
APPLY_FOR_AGG_VARIANT_ALL(HASH_SET_METHOD)
#undef HASH_SET_METHOD
_mem_tracker->set(_aggregator->hash_set_variant().memory_usage() +
_aggregator->mem_pool()->total_reserved_bytes());
_mem_tracker->set(_aggregator->hash_set_variant().reserved_memory_usage(_aggregator->mem_pool()));
TRY_CATCH_BAD_ALLOC(_aggregator->try_convert_to_two_level_set());
_aggregator->update_num_input_rows(chunk->num_rows());
@ -90,7 +91,7 @@ Status DistinctBlockingNode::open(RuntimeState* state) {
COUNTER_SET(_aggregator->input_row_count(), _aggregator->num_input_rows());
_mem_tracker->set(_aggregator->hash_set_variant().memory_usage() + _aggregator->mem_pool()->total_reserved_bytes());
_mem_tracker->set(_aggregator->hash_set_variant().reserved_memory_usage(_aggregator->mem_pool()));
return Status::OK();
}
@ -133,7 +134,8 @@ Status DistinctBlockingNode::get_next(RuntimeState* state, ChunkPtr* chunk, bool
std::vector<std::shared_ptr<pipeline::OperatorFactory> > DistinctBlockingNode::decompose_to_pipeline(
pipeline::PipelineBuilderContext* context) {
using namespace pipeline;
OpFactories operators_with_sink = _children[0]->decompose_to_pipeline(context);
OpFactories ops_with_sink = _children[0]->decompose_to_pipeline(context);
// Create a shared RefCountedRuntimeFilterCollector
auto&& rc_rf_probe_collector = std::make_shared<RcRfProbeCollector>(2, std::move(this->runtime_filter_collector()));
@ -147,26 +149,43 @@ std::vector<std::shared_ptr<pipeline::OperatorFactory> > DistinctBlockingNode::d
// Initialize OperatorFactory's fields involving runtime filters.
this->init_runtime_filter_for_operator(sink_operator.get(), context, rc_rf_probe_collector);
OpFactories operators_with_source;
OpFactories ops_with_source;
auto source_operator = std::make_shared<AggregateDistinctBlockingSourceOperatorFactory>(context->next_operator_id(),
id(), aggregator_factory);
// Initialize OperatorFactory's fields involving runtime filters.
this->init_runtime_filter_for_operator(source_operator.get(), context, rc_rf_probe_collector);
operators_with_sink = context->maybe_interpolate_local_shuffle_exchange(runtime_state(), operators_with_sink,
partition_expr_ctxs);
operators_with_sink.push_back(std::move(sink_operator));
context->add_pipeline(operators_with_sink);
bool need_local_shuffle = true;
if (auto* exchange_op = dynamic_cast<ExchangeSourceOperatorFactory*>(ops_with_sink[0].get());
exchange_op != nullptr) {
auto& texchange_node = exchange_op->texchange_node();
DCHECK(texchange_node.__isset.partition_type);
need_local_shuffle = texchange_node.partition_type != TPartitionType::HASH_PARTITIONED &&
texchange_node.partition_type != TPartitionType::BUCKET_SHUFFLE_HASH_PARTITIONED;
}
if (need_local_shuffle) {
ops_with_sink =
context->maybe_interpolate_local_shuffle_exchange(runtime_state(), ops_with_sink, partition_expr_ctxs);
}
ops_with_sink.push_back(std::move(sink_operator));
context->add_pipeline(ops_with_sink);
// Aggregator must be used by a pair of sink and source operators,
// so operators_with_source's degree of parallelism must be equal with operators_with_sink's
auto degree_of_parallelism = ((SourceOperatorFactory*)(operators_with_sink[0].get()))->degree_of_parallelism();
// so operators_with_source's degree of parallelism must be equal with ops_with_sink's
auto degree_of_parallelism = ((SourceOperatorFactory*)(ops_with_sink[0].get()))->degree_of_parallelism();
source_operator->set_degree_of_parallelism(degree_of_parallelism);
operators_with_source.push_back(std::move(source_operator));
ops_with_source.push_back(std::move(source_operator));
if (!_tnode.conjuncts.empty() || ops_with_source.back()->has_runtime_filters()) {
ops_with_source.emplace_back(
std::make_shared<ChunkAccumulateOperatorFactory>(context->next_operator_id(), id()));
}
if (limit() != -1) {
operators_with_source.emplace_back(
ops_with_source.emplace_back(
std::make_shared<LimitOperatorFactory>(context->next_operator_id(), id(), limit()));
}
return operators_with_source;
return ops_with_source;
}
} // namespace starrocks::vectorized

View File

@ -53,7 +53,7 @@ Status DistinctStreamingNode::get_next(RuntimeState* state, ChunkPtr* chunk, boo
size_t input_chunk_size = input_chunk->num_rows();
_aggregator->update_num_input_rows(input_chunk_size);
COUNTER_SET(_aggregator->input_row_count(), _aggregator->num_input_rows());
_aggregator->evaluate_exprs(input_chunk.get());
RETURN_IF_ERROR(_aggregator->evaluate_exprs(input_chunk.get()));
if (_aggregator->streaming_preaggregation_mode() == TStreamingPreaggregationMode::FORCE_STREAMING) {
// force execute streaming
@ -80,8 +80,7 @@ Status DistinctStreamingNode::get_next(RuntimeState* state, ChunkPtr* chunk, boo
COUNTER_SET(_aggregator->hash_table_size(), (int64_t)_aggregator->hash_set_variant().size());
_mem_tracker->set(_aggregator->hash_set_variant().memory_usage() +
_aggregator->mem_pool()->total_reserved_bytes());
_mem_tracker->set(_aggregator->hash_set_variant().reserved_memory_usage(_aggregator->mem_pool()));
TRY_CATCH_BAD_ALLOC(_aggregator->try_convert_to_two_level_set());
continue;
@ -114,8 +113,7 @@ Status DistinctStreamingNode::get_next(RuntimeState* state, ChunkPtr* chunk, boo
COUNTER_SET(_aggregator->hash_table_size(), (int64_t)_aggregator->hash_set_variant().size());
_mem_tracker->set(_aggregator->hash_set_variant().memory_usage() +
_aggregator->mem_pool()->total_reserved_bytes());
_mem_tracker->set(_aggregator->hash_set_variant().reserved_memory_usage(_aggregator->mem_pool()));
TRY_CATCH_BAD_ALLOC(_aggregator->try_convert_to_two_level_set());
continue;

View File

@ -274,6 +274,10 @@ void Aggregator::close(RuntimeState* state) {
}
_is_closed = true;
// Clear the buffer
while (!_buffer.empty()) {
_buffer.pop();
}
auto agg_close = [this, state]() {
// _mem_pool is nullptr means prepare phase failed

View File

@ -70,7 +70,7 @@ public:
void close(RuntimeState* state) override;
std::unique_ptr<MemPool>& mem_pool() { return _mem_pool; };
const MemPool* mem_pool() const { return _mem_pool.get(); }
bool is_none_group_by_exprs() { return _group_by_expr_ctxs.empty(); }
const std::vector<ExprContext*>& conjunct_ctxs() { return _conjunct_ctxs; }
const std::vector<ExprContext*>& group_by_expr_ctxs() { return _group_by_expr_ctxs; }

View File

@ -307,6 +307,10 @@ void Analytor::close(RuntimeState* state) {
return;
}
while (!_buffer.empty()) {
_buffer.pop();
}
_input_chunks.clear();
_is_closed = true;
auto agg_close = [this, state]() {

Some files were not shown because too many files have changed in this diff Show More