[globalindex] Support multi-column GlobalIndex framework by CrownChu · Pull Request #7933 · apache/paimon

CrownChu · 2026-05-22T03:25:04Z

Extend the GlobalIndex SPI, build path, and query path to support one index builder handling multiple columns (e.g. Lucene indexing title + content + tags together). Key changes:

GlobalIndexerFactory/GlobalIndexer: add List create overloads
GlobalIndexMultiColumnWriter: new interface for multi-column writes
GlobalIndexBuilderUtils: toIndexFileMetas/createIndexWriter accept List
GlobalIndexScanner: route extraFieldIds to same reader group
VectorScanImpl/FullTextScanImpl: match against extraFieldIds
GenericIndexTopoBuilder (Flink): multi-column projection and writer dispatch
DefaultGlobalIndexBuilder/TopoBuilder (Spark): multi-column support
All single-column APIs preserved for backward compatibility

Purpose

Some index engines (e.g. Lucene) can build a single index over multiple columns — full-text on title and vector on embedding in the same index file. Previously the GlobalIndex SPI only supported one column
per indexer: GlobalIndexerFactory.create(DataField, Options) and GlobalIndexSingletonWriter.write(Object). This meant multi-column engines had to create separate index files per column, losing co-located
search benefits and doubling I/O.

This PR adds a multi-column path through the entire stack:

SPI layer (paimon-common): GlobalIndexerFactory.create(List, Options) default method (falls back to single-column for existing implementations). New GlobalIndexMultiColumnWriter interface
accepts InternalRow with all indexed columns projected in field order.
Core index metadata (paimon-core): GlobalIndexBuilderUtils.toIndexFileMetas populates GlobalIndexMeta.extraFieldIds from the field list beyond the first. GlobalIndexScanner.createReaders resolves the full
field list from metadata and passes it to the factory. Extra field IDs are registered in the indexMetas map so queries against any column in the group find the same reader.
Query path (paimon-core): VectorScanImpl and FullTextScanImpl check extraFieldIds when matching index files to query columns, so a vector query on embedding finds an index whose primary field is title but
includes embedding as an extra field.
Flink build (paimon-flink): GenericIndexTopoBuilder accepts List indexColumns, projects all columns + _ROW_ID, and dispatches to GlobalIndexMultiColumnWriter.write(InternalRow) when multi-column.
findMinNonIndexableRowId checks containsAll(indexColumns) for schema evolution safety.
Spark build (paimon-spark): DefaultGlobalIndexBuilder and DefaultGlobalIndexTopoBuilder gain List overloads, with single-column constructors delegating to the multi-column path.

All existing single-column callers are unchanged — new APIs have default implementations that delegate to the original single-column methods.

Tests

GenericIndexTopoBuilderTest: Updated findMinNonIndexableRowId call to use List signature.
paimon-lucene module (on feature-globalindex-support-multi-test branch): A full Lucene 9.12.0 implementation exercising the multi-column framework end-to-end:
- LuceneGlobalIndexTest — single-column vector write/search, top-K score ordering (2 tests)
- LuceneFullTextIndexTest — single-column full-text write/search, score verification, no-results case (3 tests)
- LuceneMultiColumnIndexTest — multi-column (text + vector) via GlobalIndexerFactory.create(List, Options) → GlobalIndexMultiColumnWriter.write(InternalRow), verifies both full-text and vector
  queries on same index file; also tests SPI discovery path via GlobalIndexer.create("lucene", ...) (2 tests)
- LuceneGlobalIndexScanTest — end-to-end Paimon table tests: creates FileStoreTable, writes data, builds Lucene index, commits via DataIncrement.indexIncrement, queries via
  VectorSearchBuilder/FullTextSearchBuilder → ReadBuilder.newScan().withGlobalIndexResult(), reads back rows and asserts correctness (3 tests)

JingsongLi

Review: [globalindex] Support multi-column GlobalIndex framework

Overall this is a well-structured change that cleanly extends the single-column SPI to support multi-column indexes. The backward compatibility is handled well via default methods. I have a few concerns about correctness and robustness:

1. [Bug] Flink `BuildIndexOperator`: Multi-column writer receives row with extra `_ROW_ID` field

File: paimon-flink/paimon-flink-common/src/main/java/org/apache/paimon/flink/globalindex/GenericIndexTopoBuilder.java

The projected row type is built as:

List<String> readColumns = new ArrayList<>(indexColumns);
readColumns.add(SpecialFields.ROW_ID.name());
RowType projectedRowType = SpecialFields.rowTypeWithRowId(rowType).project(readColumns);

So the InternalRow read from data has the layout [indexCol1, indexCol2, ..., _ROW_ID]. In the multi-column branch:

if (multiColumn) {
    ((GlobalIndexMultiColumnWriter) indexWriter).write(row);
}

The entire row (including the trailing _ROW_ID field) is passed to the writer. But the writer was created with only the index fields via createIndexWriter(table, indexType, indexFields, mergedOptions), and the Javadoc on GlobalIndexMultiColumnWriter.write() states: "The row layout matches the fields order passed to GlobalIndexerFactory#create(List, Options)".

This is a contract violation. Writer implementations that iterate over all row fields or check row.getFieldCount() will see an unexpected extra column. You should either:

Create a sub-projection of the row that excludes _ROW_ID before passing it to the writer, or
Clearly document that the row may contain trailing fields beyond the indexed columns (and update the Javadoc accordingly).

2. [Correctness] No null-field handling in multi-column mode (Flink)

In single-column mode, if the indexed field is null, the shard stops early:

Object fieldData = indexFieldGetters[0].getFieldOrNull(row);
if (fieldData == null) {
    LOG.info("Null value at rowId={}, stopping shard [{}, {}].", ...);
    break;
}

In multi-column mode, the row is passed directly without any null check on individual fields. If one of the indexed columns is null in a multi-column row, the behavior depends entirely on the writer implementation. This asymmetry could lead to:

Writer failures (NPE inside Lucene, for example)
Silent corruption of the index

Consider at minimum documenting this contract difference, or adding validation that checks indexed columns for null before passing to the multi-column writer.

3. [Robustness] Unsafe cast without `instanceof` check

Files: GenericIndexTopoBuilder.java (Flink), DefaultGlobalIndexBuilder.java (Spark)

Both paths cast the writer based solely on the multiColumn flag:

((GlobalIndexMultiColumnWriter) indexWriter).write(row);

However, GlobalIndexerFactory.create(List<DataField>, Options) has a default implementation that falls back to single-column:

default GlobalIndexer create(List<DataField> fields, Options options) {
    return create(fields.get(0), options);
}

If an existing factory has not been updated to support multi-column (and relies on this default), it will return a GlobalIndexSingletonWriter, and the cast will fail with a ClassCastException at runtime.

Suggestion: Add a validation check after writer creation:

if (multiColumn && !(indexWriter instanceof GlobalIndexMultiColumnWriter)) {
    throw new UnsupportedOperationException(
        "Index type '" + indexType + "' does not support multi-column indexing. " +
        "The factory must override create(List<DataField>, Options) and return a GlobalIndexMultiColumnWriter.");
}

4. [Design] Interface default method creates silent data-loss path

File: paimon-spark/paimon-spark-common/src/main/java/org/apache/paimon/spark/globalindex/GlobalIndexTopologyBuilder.java

The new default method on the interface:

default List<CommitMessage> buildIndex(..., List<DataField> indexFields, ...) {
    return buildIndex(..., indexFields.get(0), options);
}

This silently discards all fields beyond the first for any implementation that has not been updated to override the multi-field method. Combined with DefaultGlobalIndexTopoBuilder overriding the single-field method to delegate to multi-field, this creates an unusual delegation pattern that is correct for DefaultGlobalIndexTopoBuilder but could be a trap for other implementations.

Consider adding a log warning in the default method when indexFields.size() > 1, or throwing UnsupportedOperationException instead of silently dropping fields.

5. [Minor] `resolveFields` assumes metadata consistency across range groups

File: paimon-core/src/main/java/org/apache/paimon/globalindex/GlobalIndexScanner.java

private List<DataField> resolveFields(Map<Range, List<IndexFileMeta>> metas, RowType rowType) {
    GlobalIndexMeta firstMeta =
            checkNotNull(metas.values().iterator().next().get(0).globalIndexMeta());
    ...
}

This takes the first index file's metadata from an arbitrary range to determine the field list. If different ranges have inconsistent metadata (e.g., during a transition where an old single-column index coexists with a new multi-column index for the same type), this could resolve incorrect fields. A defensive check that all ranges share the same field list would prevent hard-to-diagnose query errors.

Summary

The most critical issue is #1 (the _ROW_ID field leaking into the multi-column writer). Issues #2 and #3 are important for production robustness. Issues #4 and #5 are lower priority but worth addressing for long-term maintainability.

Extend the GlobalIndex SPI, build path, and query path to support one index builder handling multiple columns (e.g. Lucene indexing title + content + tags together). Key changes: - GlobalIndexerFactory/GlobalIndexer: add List<DataField> create overloads - GlobalIndexMultiColumnWriter: new interface for multi-column writes - GlobalIndexBuilderUtils: toIndexFileMetas/createIndexWriter accept List<DataField> - GlobalIndexScanner: route extraFieldIds to same reader group - VectorScanImpl/FullTextScanImpl: match against extraFieldIds - GenericIndexTopoBuilder (Flink): multi-column projection and writer dispatch - DefaultGlobalIndexBuilder/TopoBuilder (Spark): multi-column support - All single-column APIs preserved for backward compatibility

Allow index_column parameter to accept comma-separated column names (e.g. "title,embedding") for both Flink and Spark procedures. Add List<String> overload for GenericIndexTopoBuilder.buildIndexAndExecute.

CrownChu · 2026-05-25T18:05:22Z

Fixes and Additions in This Round

1. [Bug] Flink BuildIndexOperator: Multi-column writer receives row with extra _ROW_ID field

Fix: The read projection includes indexColumns + _ROW_ID (_ROW_ID is needed for shard boundary positioning). Before passing to the writer, a secondary projection via
ProjectedRow.from(indexOnlyMapping) strips the trailing _ROW_ID, ensuring GlobalIndexMultiColumnWriter.write() receives a row containing only the index fields.

2. [Correctness] No null-field handling in multi-column mode (Flink)

Fix: In multi-column mode, each index field is checked individually. If any field is null, the current shard stops writing immediately (break). Only rows where all indexed columns are non-null are
written to the index.

3. [Robustness] Unsafe cast without instanceof check

Fix: Added instanceof GlobalIndexMultiColumnWriter check before casting. If the check fails, an UnsupportedOperationException is thrown with a clear message indicating the factory must override
create(List<DataField>, Options) and return a GlobalIndexMultiColumnWriter.

4. [Design] Interface default method creates silent data-loss path

Fix: Runtime instanceof check + exception ensures that when a factory does not properly implement multi-column support, it fails fast rather than silently falling through to SingletonWriter and losing
data.

5. [Minor] resolveFields assumes metadata consistency across range groups

Fix: For multi-column indexes (indexFieldId == -1), added validation that iterates all entries across range groups and verifies indexFieldId and extraFieldIds are consistent. Throws an exception on
mismatch. Single-column case retains the original logic unchanged.

6. Multi-column minimum rowId for schema evolution

findMinNonIndexableRowId accepts List<String> indexColumns and uses containsAll(indexColumns) to check. If any index column is missing from a file's schema, that file is considered non-indexable and
its firstRowId is used as the filter boundary.

7. Multi-column GlobalIndexMeta storage convention

In multi-column mode, indexFieldId = -1 (MULTI_COLUMN_INDEX_FIELD_ID constant) and all actual field IDs are stored in extraFieldIds. Single-column remains unchanged (indexFieldId = actual field ID,
extraFieldIds = null). Scanner skips registering under -1 and routes predicates to the correct reader solely through extraFieldIds.

8. New ES Topology Builder (es-multi-index)

Added ESIndexTopoBuilder (Flink) — standalone topology with shard-level parallelism, schema evolution check, and transparent indexType pass-through
Added ESGlobalIndexTopoBuilder (Spark) — registered via SPI, identifier() returns "es-multi-index"
Flink CreateGlobalIndexProcedure routing: startsWith("es-multi-index") → ESIndexTopoBuilder
Spark GlobalIndexTopologyBuilderUtils: exact match first, then prefix match

9. Multi-condition reader predicate push-down

GlobalIndexEvaluator refactored to async parallel execution (CompletableFuture), no longer implements PredicateVisitor
Added flattenChildren() to flatten nested CompoundPredicates of the same type
UnionGlobalIndexReader removed internal ExecutorService; parallelism managed by the upper-level evaluator
GlobalIndexScanner registers extraFieldIds into the index map, ensuring multi-column predicates hit the correct reader

jerry-024 · 2026-05-26T07:05:05Z

-                                        task.shardRange.to);
-                                break;
+                            if (multiColumn) {
+                                ((GlobalIndexMultiColumnWriter) indexWriter).write(row);


Bug: missing null check in multi-column path.

The single-column path below checks for null and breaks the shard, and ESIndexTopoBuilder.BuildESIndexOperator.processElement() in this same PR also checks all columns for null before writing. But this multi-column path writes row directly with no null check.

Known implementations like LuminaVectorGlobalIndexWriter.write() throw IllegalArgumentException on null input — so a multi-column index containing a vector column will crash the Flink job if any row has a null value in the indexed columns.

Suggested fix — add the same null-field guard that ESIndexTopoBuilder has:

if (multiColumn) { boolean hasNull = false; for (InternalRow.FieldGetter getter : indexFieldGetters) { if (getter.getFieldOrNull(row) == null) { hasNull = true; break; } } if (hasNull) { LOG.info( "Null value in indexed columns at rowId={}, stopping shard [{}, {}].", currentRowId, task.shardRange.from, task.shardRange.to); break; } ((GlobalIndexMultiColumnWriter) indexWriter).write(row); }

jerry-024

The multi-column GlobalIndex SPI framework changes look good. However, the ES-specific code (ESIndexTopoBuilder, ESGlobalIndexTopoBuilder, ES routing in procedures, prefix matching in utils, and SPI registration) is a separate feature that happens to use the multi-column framework. Suggest splitting it into its own PR to keep this one focused.

Also, findMinNonIndexableRowId and filterEntriesBefore are copy-pasted across GenericIndexTopoBuilder, ESIndexTopoBuilder, and ESGlobalIndexTopoBuilder — consider extracting them into a shared utility.

jerry-024 · 2026-05-26T10:03:08Z

+ * with multi-column support.
+ */
+public class ESIndexTopoBuilder {
+


This entire file is an ES-specific topology builder, unrelated to the multi-column GlobalIndex SPI framework. Consider moving it to a separate PR to keep this one focused on the framework changes.

Also, findMinNonIndexableRowId, filterEntriesBefore, computeShardTasks, and closeWriterQuietly are copy-pasted from GenericIndexTopoBuilder — consider extracting them into a shared utility (e.g. GlobalIndexBuilderUtils).

jerry-024 · 2026-05-26T10:03:08Z

+ * parallelism. Supports both single-column and multi-column indexing.
+ */
+public class ESGlobalIndexTopoBuilder implements GlobalIndexTopologyBuilder {
+


Same as ESIndexTopoBuilder on the Flink side — this ES-specific Spark topology builder is not part of the multi-column framework and should go in a separate PR.

findMinNonIndexableRowId and filterEntriesBefore are again duplicated here (third copy). Please extract into a shared utility.

jerry-024 · 2026-05-26T10:03:08Z

                return new String[] {
                    "BTree global index created successfully for table: " + table.name()
                };
+            } else if (indexType.startsWith(ESIndexTopoBuilder.ES_INDEX_TYPE_PREFIX)) {


This ES routing branch (ESIndexTopoBuilder.ES_INDEX_TYPE_PREFIX) is ES-specific and not related to multi-column support. Consider moving it to the ES-specific PR.

jerry-024 · 2026-05-26T10:03:08Z

+        if (builder != null) {
+            return builder;
+        }
+        // Prefix match: e.g. "es-multi-index-diskbbq" matches registered "es-multi-index"


The prefix matching logic here is ES-specific — the comment even says "es-multi-index-diskbbq" matches registered "es-multi-index". Consider moving this to the ES-specific PR.

jerry-024 · 2026-05-26T10:03:08Z

 # limitations under the License.

 org.apache.paimon.spark.globalindex.btree.BTreeIndexTopoBuilder
+org.apache.paimon.spark.globalindex.ESGlobalIndexTopoBuilder


Registration of ESGlobalIndexTopoBuilder should go with the ES-specific PR.

…ds validation

…e into GlobalIndexBuilderUtils

jerry-024

Thanks for splitting out the ES code and extracting the shared utilities — much cleaner now. Found a few issues in the latest version:

jerry-024 · 2026-05-26T11:46:05Z

@@ -475,7 +475,7 @@ void testAppendFilterOldFilesBeforeNewFiles() {
                GenericIndexTopoBuilder.filterEntriesBefore(
                        entries,
                        GenericIndexTopoBuilder.findMinNonIndexableRowId(


Compile error: findMinNonIndexableRowId and filterEntriesBefore were moved from GenericIndexTopoBuilder to GlobalIndexBuilderUtils in commit 0cfc7ef, but this test still references them via GenericIndexTopoBuilder.findMinNonIndexableRowId(...) and GenericIndexTopoBuilder.filterEntriesBefore(...). This will fail to compile.

Should be:

GlobalIndexBuilderUtils.filterEntriesBefore( entries, GlobalIndexBuilderUtils.findMinNonIndexableRowId( schemaManager, entries, Collections.singletonList("vec")));

jerry-024 · 2026-05-26T11:46:05Z

+                            } else {
+                                Object fieldData = indexFieldGetters[0].getFieldOrNull(row);
+                                if (fieldData == null) {
+                                    LOG.info(


Multi-column writer receives extra _ROW_ID column: In the multi-column path, row is passed directly to GlobalIndexMultiColumnWriter.write(row), but this row comes from projectedRowType which is indexColumns + _ROW_ID. The writer's contract (javadoc on GlobalIndexMultiColumnWriter.write) says the row layout should match the fields passed to GlobalIndexerFactory.create(List<DataField>, Options) — which doesn't include _ROW_ID.

The previous ES-specific code handled this with a ProjectedRow that stripped _ROW_ID before writing. Consider adding a similar projection here, or clarifying the writer contract.

jerry-024 · 2026-05-26T11:46:05Z

+                GlobalIndexMultiColumnWriter multiWriter =
+                        (GlobalIndexMultiColumnWriter) indexWriter;
+                rows.forEachRemaining(
+                        row -> {


Same _ROW_ID issue as the Flink side: rows come from a reader using readType = indexColumns + _ROW_ID, but the multi-column writer expects only index columns. The row passed to multiWriter.write(row) includes the extra _ROW_ID field.

jerry-024 · 2026-05-26T11:46:05Z

@@ -97,7 +106,7 @@ public String[] call(
                BTreeIndexTopoBuilder.buildIndexAndExecute(
                        procedureContext.getExecutionEnvironment(),
                        table,


BTree silently drops extra columns: when a user passes "col1,col2" with index type btree, only indexColumns.get(0) is used — no error, no warning. Consider adding a validation:

if ("btree".equalsIgnoreCase(indexType)) { checkArgument(indexColumns.size() == 1, "BTree index only supports single column, got: %s", indexColumns); }

jerry-024 · 2026-05-26T11:46:05Z

+        }
+        return minRowId;
+    }
+


Minor: the old filterEntriesBefore in GenericIndexTopoBuilder had a LOG.info("Filtered {} files ...") line for observability. This was lost during extraction since GlobalIndexBuilderUtils has no logger. Consider adding one — this log is useful for debugging index build issues in production.

add log done

jerry-024

Found additional high-risk issues in code paths NOT modified by this PR but broken by the introduction of MULTI_COLUMN_INDEX_FIELD_ID = -1:

jerry-024 · 2026-05-27T01:31:46Z

 import java.util.ArrayList;
+import java.util.HashMap;
 import java.util.List;
+import java.util.Map;


High risk — MERGE path crash: MULTI_COLUMN_INDEX_FIELD_ID = -1 breaks existing code that calls rowType.getField(globalIndexMeta.indexFieldId()) without guarding against -1:

MergeIntoUpdateChecker.java:104 (Flink): scans index manifest entries and does rowType.getField(globalIndexMeta.indexFieldId()) — will throw when encountering a multi-column index.

MergeIntoPaimonDataEvolutionTable.scala:514 (Spark): same pattern — rowType.getField(globalIndexMeta.indexFieldId()).name().

Once a table has a multi-column global index, any MERGE INTO that touches indexed columns will crash with "Cannot find field by field id: -1".

Fix: these callers need to handle MULTI_COLUMN_INDEX_FIELD_ID by reading extraFieldIds() to get the actual column list.

Fix:

Added getIndexedFieldNames helper in both Flink and Spark paths:

When indexFieldId == MULTI_COLUMN_INDEX_FIELD_ID (-1): resolve column names from extraFieldIds()

Otherwise: use the original single-column logic (rowType.getField(indexFieldId) + optional extraFieldIds)

Both the index filter (which entries are affected) and the error reporting (conflicted column names) now correctly handle multi-column indexes.

Affected files:

paimon-flink/.../dataevolution/MergeIntoUpdateChecker.java

paimon-spark/paimon-spark-common/.../MergeIntoPaimonDataEvolutionTable.scala

paimon-spark/paimon-spark-4.0/.../MergeIntoPaimonDataEvolutionTable.scala

jerry-024 · 2026-05-27T01:31:46Z

+                            if (textColumn.id() == id) {
+                                return true;
+                            }
+                        }


High risk — read path mismatch: FullTextScanImpl now correctly selects multi-column index files via the extraFieldIds check added in this PR. However, the corresponding read path in FullTextReadImpl.java:72-74 still creates the reader with:

GlobalIndexerFactoryUtils.load(indexType).create(textColumn, options)

This uses the single-column factory method. When the selected index file was built with create(List<DataField>[text, vector], options), reading it with a single-column reader will either fail to decode the file or produce incorrect results.

The scan/filter path and the read path are now inconsistent — scan discovers multi-column files, but read doesn't know how to open them. Need to use resolveFields-style metadata lookup + create(List<DataField>, Options) in the read path as well.

Before creating the GlobalIndexer, inspect the first index file's GlobalIndexMeta to determine the file's column layout:

indexFieldId == -1 (multi-column): resolve all field IDs from extraFieldIds(), create reader with factory.create(List, options)

Otherwise (single-column): use original factory.create(textColumn, options)

This ensures the reader matches the format used when the index file was built.

Affected file:

paimon-core/src/main/java/org/apache/paimon/table/source/FullTextReadImpl.java

Same approach as FullTextReadImpl — inspect the first index file's GlobalIndexMeta before creating the reader:

indexFieldId == -1 (multi-column): resolve fields from extraFieldIds(), use factory.create(List, options)

Otherwise (single-column): use original factory.create(vectorColumn, options)

Affected file:

paimon-core/src/main/java/org/apache/paimon/table/source/VectorReadImpl.java

jerry-024 · 2026-05-27T01:31:46Z

+                        for (int id : globalIndex.extraFieldIds()) {
+                            if (vectorColumn.id() == id || filterFieldIds.contains(id)) {
+                                return true;
+                            }


Same read path issue as FullText: VectorReadImpl.java:87-89 creates the reader with create(vectorColumn, options) (single-column), but this scan filter now discovers multi-column index files. The read side needs the same multi-column awareness.

same as last comment,fixed

jerry-024 · 2026-05-27T01:31:46Z

+                            } else {
+                                Object fieldData = indexFieldGetters[0].getFieldOrNull(row);
+                                if (fieldData == null) {
+                                    LOG.info(


Still unresolved from last review: multi-column writer receives the full projected row including _ROW_ID. Need to project down to index-only columns before write(row). See ProjectedRow approach.

commit no push，now already push

jerry-024 · 2026-05-27T01:31:46Z

@@ -97,7 +106,7 @@ public String[] call(
                BTreeIndexTopoBuilder.buildIndexAndExecute(
                        procedureContext.getExecutionEnvironment(),
                        table,


Still unresolved: BTree and other index types that don't support multi-column will silently drop extra columns. When indexColumns.size() > 1 and the factory doesn't override create(List<DataField>, Options), the default implementation falls back to fields.get(0) without any error.

Suggest: throw UnsupportedOperationException in the default GlobalIndexerFactory.create(List<DataField>, Options) when fields.size() > 1, instead of silently falling back to create(fields.get(0), options). This forces implementations to explicitly opt-in to multi-column support.

…hod extraction

…n, and restore observability logs

… index (indexFieldId=-1)

…-column for unsupported index types

…k, and multi-column guard

…count is unlimited

jerry-024 · 2026-05-27T08:08:17Z

+1

Resolve conflict in GenericIndexTopoBuilder: keep multi-column write logic, unify variable name to rowsSeen.

CrownChu force-pushed the feature-globalindex-support-multi branch 2 times, most recently from 394d600 to 682e613 Compare May 22, 2026 03:35

JingsongLi reviewed May 23, 2026

View reviewed changes

CrownChu added 2 commits May 25, 2026 20:38

[globalindex] Support multi-column in CreateGlobalIndexProcedure

c9d47d0

Allow index_column parameter to accept comma-separated column names (e.g. "title,embedding") for both Flink and Spark procedures. Add List<String> overload for GenericIndexTopoBuilder.buildIndexAndExecute.

CrownChu force-pushed the feature-globalindex-support-multi branch 2 times, most recently from cab2375 to 52a3cb5 Compare May 25, 2026 12:46

jerry-024 reviewed May 26, 2026

View reviewed changes

CrownChu added 2 commits May 26, 2026 18:36

[globalindex] Fix multi-column index metadata storage and resolveFiel…

8c4d5f2

…ds validation

[globalindex] Fix GenericIndexTopoBuilder multi-column null value error

a28f05f

CrownChu force-pushed the feature-globalindex-support-multi branch from 136209d to a28f05f Compare May 26, 2026 10:38

[globalindex] Extract findMinNonIndexableRowId and filterEntriesBefor…

0cfc7ef

…e into GlobalIndexBuilderUtils

jerry-024 reviewed May 26, 2026

View reviewed changes

jerry-024 reviewed May 27, 2026

View reviewed changes

CrownChu added 2 commits May 27, 2026 11:04

[globalindex] Fix test to reference GlobalIndexBuilderUtils after met…

a9a731d

…hod extraction

[globalindex] Fix multi-column writer projection, add BTree validatio…

55a445a

…n, and restore observability logs

CrownChu force-pushed the feature-globalindex-support-multi branch from d7d7b42 to 55a445a Compare May 27, 2026 03:05

CrownChu added 4 commits May 27, 2026 11:17

[globalindex] Fix MERGE INTO crash when table has multi-column global…

cdf2605

… index (indexFieldId=-1)

[globalindex] Fix FullText/Vector read path mismatch and reject multi…

1cb4d24

…-column for unsupported index types

[globalindex] Add input validation, Spark schema filtering, null chec…

dc550e1

…k, and multi-column guard

[globalindex] Reject duplicate index columns and document why column …

be43a94

…count is unlimited

Merge apache/master into feature-globalindex-support-multi

0ac045a

Resolve conflict in GenericIndexTopoBuilder: keep multi-column write logic, unify variable name to rowsSeen.

Conversation

CrownChu commented May 22, 2026

Uh oh!

JingsongLi left a comment

Choose a reason for hiding this comment

Review: [globalindex] Support multi-column GlobalIndex framework

1. [Bug] Flink BuildIndexOperator: Multi-column writer receives row with extra _ROW_ID field

2. [Correctness] No null-field handling in multi-column mode (Flink)

3. [Robustness] Unsafe cast without instanceof check

4. [Design] Interface default method creates silent data-loss path

5. [Minor] resolveFields assumes metadata consistency across range groups

Summary

Uh oh!

CrownChu commented May 25, 2026

Fixes and Additions in This Round

1. [Bug] Flink BuildIndexOperator: Multi-column writer receives row with extra _ROW_ID field

2. [Correctness] No null-field handling in multi-column mode (Flink)

3. [Robustness] Unsafe cast without instanceof check

4. [Design] Interface default method creates silent data-loss path

5. [Minor] resolveFields assumes metadata consistency across range groups

6. Multi-column minimum rowId for schema evolution

7. Multi-column GlobalIndexMeta storage convention

8. New ES Topology Builder (es-multi-index)

9. Multi-condition reader predicate push-down

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jerry-024 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jerry-024 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jerry-024 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

1. [Bug] Flink `BuildIndexOperator`: Multi-column writer receives row with extra `_ROW_ID` field

3. [Robustness] Unsafe cast without `instanceof` check

5. [Minor] `resolveFields` assumes metadata consistency across range groups