[RNE Rewrite] Extend core functionality: tensor views, dtypes#1293
[RNE Rewrite] Extend core functionality: tensor views, dtypes#1293IgorSwat wants to merge 3 commits into
Conversation
There was a problem hiding this comment.
Long one, but please read it carefully. Regarding the whole PR, here is my take on that:
-
Addition of new types namely bool and int64 is definitely needed and we should keep it.
-
Regarding the view, I think that eventually we should also keep it, but this one needs big design changes. The object you call "a view" is not what view usually means. You cannot reshape that as in NumPy or Torch. It has arbitrary offset and shape in comparison to parent, actual tensor. It's literally: "reinterpret a contiguous byte sub-range as a new tensor". So it should be rather call it
sliceorsubTensor. Another thing is that offset is in bytes and shape is in elements number. You should keep the offset in number of elements, otherwise you:- User has to compute
N * dtype.size()in JS - strange, mixing concepts, - It's unidiomatic (NumPy and Torch offset are in elements),
- A non-modulo size of element offset in bytes causes misaligned ptr -> undefined behaviour.
The third problem: This codebase cares about thread safety. Bartek designed this carefully to make it possible, namely
copyTo/setData/getData/disposedoestry_to_lockonmutex_. But eachTensorHostObjecthas its ownmutex_, and a view is a separate object pointing to the parent's bytes. Soparent.setData()(locksparent.mutex_) andview.getData()(locksview.mutex_) can run concurrently on the same memory — each happily holding its own lock. The "one mutex guards one memory region" rule is broken the moment a view exists. That's a data race the current single-tensor code can't have. You should rather:- Hold a
shared_ptrto the parent in the view → fixes the use-after-free and makes the "remember to dispose the parent last" comment unnecessary (ownership enforces it), - Share the parent's mutex_ (the view should lock the same mutex) → fixes the aliasing race.
Yet another thing, you should add the check:
offset_elems + numel <= parent.numel→ fixes OOB and the alignment hazard. - User has to compute
-
Regarding the
DTyperewrite (free functions → class): I'd drop this one. As mentioned, the newbool/int64values are needed, but those don't require the class at all. You get 100% of thedtypebenefit with none of the class's churn. The class itself doesn't pull its weight, and I think it's actually a small step back:- It wraps a single
uint8_tand protects no invariant the validated free functions didn't already guarantee. The comment "leaves space for additional logic if needed" is speculative — YAGNI. - The implicit conversions are footguns, especially the string pair. With both
DType(const std::string&)andoperator std::string()implicit,dtype == "float32"now compiles — it resolves viaDType→std::stringand does a heap-allocating string compare instead of an enum compare, silently bypassing the type system. On top of that,operator std::string()andoperator ScalarType()can throw, so an innocuousstd::string s = dtype; carries a throw. dtype.size()is ambiguous sitting next toTensorView::size_(total tensor bytes) andnumel_.elementSize(dtype)said exactly what it returned. Same for construction:parseDType(s)/fromScalarType(s)documented "parses, may throw" / "maps from aten" at the call site;DType(x)hides that behind a constructor.
So: keep the new enum values, drop the class. If we ever do want a wrapper later, it should at minimum be explicit on the string boundaries — but I don't think we need it now.
- It wraps a single
-
We cannot just disable the input / output checks for all models and just merge it. W need to figure out some solution for this one before progressing further. One way is to reexport models with meta methods that returns dynamic shapes of the model. This way we can reimplement our checks to work with models such as Kokoro.
-
If we do not enforce linebreaks in clang-format, maybe we should increase the suggested max signs in a row to a higher value than 60, so we do not imply so many incorrectly formatted lines as here in
tensor.cpp, cc: @barhanc -
I know that these warnings raise by
readability-implicit-bool-conversionare preexisting, but we should fix them.
|
I think @msluszniak's analysis is spot on. My inherent reluctance to adding views is exactly because it makes reasoning about tensor lifetimes and concurrency difficult -- both of which are the most fragile aspects of the new architecture implementation and must be taken with utmost care, since if they aren't ironclad everything else can fall apart. At the same time I agree that it is something we actually might want to add to
Ideally it would be possible to add views as purely additive change that wouldn't require changing any of the validation code in cpp ops outside Regarding
Do you mean the lines already present in the code or the whitespace changes introduced in this PR? The whitespace changes shouldn't be there and they are probably because the files were first formatted with columnWidth set to something and then after rebase introduced our column-width-lenient .clang-format config the formatter didn't change them back. They should be manually reverted.
These are not preexisting (the clang-tidy is clean when run on the current rne-rewrite branch). Before, the |
Right, so let's just reformat the code.
Oh, forget that it was smart pointer which has an explicit bool operator. |
Description
This PR introduces the following changes to the core of the package:
Introduces a breaking change?
Type of change
Tested on
Testing instructions
Run existing demo apps / tests (if there are any).
Screenshots
Related issues
Checklist
Additional notes
Some of the changes are there because of the followup commit with text-to-speech pipeline implementation.