Adding support for .jpk-qi-data and .bin files and moving to dataclasses for the returned loaded data#190
Adding support for .jpk-qi-data and .bin files and moving to dataclasses for the returned loaded data#190ahobbs7 wants to merge 58 commits into
Conversation
…igger point or by contact point
… a function to general loader
…make the save an adjustable option
… once, then extracting separately
…files and removing the ability to run curve analysis directly in the reader
…ore robust coordinate access
…o eagerly load lazy loaded data
…me efficient method
…g to memory then saving in one go when duplicating data to h5
…sampled curves then are streamed directly into h5
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #190 +/- ##
===========================================
- Coverage 79.84% 53.12% -26.73%
===========================================
Files 12 15 +3
Lines 928 1775 +847
===========================================
+ Hits 741 943 +202
- Misses 187 832 +645 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
SylviaWhittle
left a comment
There was a problem hiding this comment.
Small partial review.
I'd like to call with you to talk about the interesting lazy loading of data if you have time?
| return frames, pixel_to_nanometre_scaling_factor, header_dict | ||
|
|
||
|
|
||
| def get_asd_channels(file_path: Path): |
There was a problem hiding this comment.
I don't have time to check this manually - does it work? There isn't a test but frankly we don't have the dev time to move slowly. If you say it works, this is fine with me.
There was a problem hiding this comment.
The channel fetching seems to work though I'm happy to quickly make some tests for them as that should be pretty quick.
| elif len(h5_returned) == 4: | ||
| image, pixel_to_nanometre_scaling_factor, _, curve_data = h5_returned # type: ignore[misc] | ||
| self.loaded_curves = True | ||
| print( |
There was a problem hiding this comment.
| print( | |
| logger.info( |
or just remove it.
There was a problem hiding this comment.
Yes, I'll just remove them, they are some accidentally left in prints for debugging.
| f"Loaded image with shape {image.shape} and pixel to nanometre " | ||
| f"scaling factor {pixel_to_nanometre_scaling_factor}" | ||
| ) | ||
| print(f"Image has max value {image.max()} and min value {image.min()}") |
| image, pixel_to_nanometre_scaling_factor = spm.load_spm(self.filepath, self.channel) | ||
| elif self.suffix == ".h5-jpk": | ||
| image, pixel_to_nanometre_scaling_factor, _ = h5_jpk.load_h5jpk(self.filepath, self.channel) | ||
| h5_returned = h5_jpk.load_h5jpk(self.filepath, self.channel, load_curves=not self.loaded_curves) |
There was a problem hiding this comment.
Could this entire block be absorbed into the h5_jpk.load_h5jpk function call? It's a little messy here.
I'll have a go too.
There was a problem hiding this comment.
Yes, is definitely a bit messy. My thinking was preventing the need to make changes to topostats by not changing what each load function returns if it isn't reading the curve data I'm adding support for. Though I think h5jpk is not used in TopoStats anyway? And it might be necessary to make changes to topostats reading anyway due to the changes I have been working on for returning the z unit for the read files?
| raise e | ||
|
|
||
| # scope for a "check what channels are available" function similar to above. | ||
| def get_available_channels(self): # noqa: C901 |
There was a problem hiding this comment.
Guessing this is for the napari feature to list the channels?
Well done if this is all working, that's a lot of work.
| A proxy object for the specified row. | ||
| """ | ||
|
|
||
| class RowProxy: |
There was a problem hiding this comment.
Is the reason RowProxy is defined within the scope of LazyQiData encapsulation? It'll get redefined each time __getitem__ runs, though this is likely a negligible and unimportant cost.
| self.parent = parent | ||
| self.y = y | ||
|
|
||
| def __getitem__(self, x: int): |
There was a problem hiding this comment.
I'm going to need a bit of an explanation on how this is intended to work - could we set up a call to talk about it? There's a lot going on here. Plus, frankly I'm just curious.
| # Set the format to have blue time, green file, module, function and line, and white message | ||
| logger.add( | ||
| sys.stderr, | ||
| lambda msg: sys.stderr.write(msg), # pylint: disable=unnecessary-lambda |
There was a problem hiding this comment.
What necessitated this change?
There was a problem hiding this comment.
It seems like the general_loader and some of the spm tests which required were failing due to not being able to correctly reading the logged output (instead always reading an empty string), they seem to be failing when I run even the main branch locally for that reason. The lamda appears to be necessary due to a quirk of loguru where lamda makes sys.stderr be dynamically retrieved at the time of logging rather than once at import time.
| return final_multiplier, final_offset, unit | ||
|
|
||
|
|
||
| class jpk_qi_loader: |
There was a problem hiding this comment.
Capitalise this please :)
| # Open the ZIP archive once and keep it open for the duration of the loading process | ||
| self.qi_archive = zipfile.ZipFile(self.filepath, "r") # pylint: disable=consider-using-with | ||
| logger.info(f"Opened JPK QI archive at {self.filepath}") | ||
| self.namelist = self.qi_archive.namelist() |
There was a problem hiding this comment.
Maybe call this list_of_all_paths or something - since namelist could be anything? Just so future people (including ourselves) can tell at a glance what this is.
Maybe also add a comment above as to what it's used for? :)
| self.qi_archive = zipfile.ZipFile(self.filepath, "r") # pylint: disable=consider-using-with | ||
| logger.info(f"Opened JPK QI archive at {self.filepath}") | ||
| self.namelist = self.qi_archive.namelist() | ||
| # Set path to the .jpk-qi-image file within the archive for later use |
There was a problem hiding this comment.
| # Set path to the .jpk-qi-image file within the archive for later use | |
| # For holding the reference to where the actual .jqk-qi image is (not the metadata). |
| if self.save_as_h5: | ||
| self.save_lite_data() | ||
|
|
||
| return (self.image, self.px2nm, (self.curve_data, self.channels_units, self.full_metadata)) |
There was a problem hiding this comment.
Possibly bundle these objects into one metadata dataclass, JPKQiCurveData? Gentle suggestion.
…to a numpy ndarray and grouping into unified CurveDataset structure
…actored dataclasses
…adata to align with potential differences in channels between volumes
| """ | ||
| self.shape_x = shape_x | ||
| self.shape_y = shape_y | ||
| self.dims = (shape_y, shape_x) |
There was a problem hiding this comment.
Should this be shape instead of dims?
| int | ||
| The total number of pixels in the image. | ||
| """ | ||
| return self.shape_x * self.shape_y |
There was a problem hiding this comment.
Perhaps the shape returned should be (x, y) rather than the absolute number of entries, since they are indexed as (x, y) rather than just the "flat" index?
| curve_num = y * self.shape_x + x | ||
| curve_data: dict[str, Any] = {} | ||
|
|
||
| for chan_name, scale in self.channel_scaling.items(): |
There was a problem hiding this comment.
Maybe for these short names, use the full name, ie "channel_name"? Just in case it could be misinterpreted.
| cumulative_multiplier = 1.0 | ||
| cumulative_offset = 0.0 | ||
| unit = props.get(f"{prefix}conversion-set.conversion.{current_slot}.scaling.unit.unit") | ||
|
|
There was a problem hiding this comment.
A comment about what slot is would be good here. :)
| # For holding the reference to where the actual .jqk-qi image is (not the metadata). | ||
| self.path_to_image = None | ||
|
|
||
| # Chunk size for H5 datasets |
There was a problem hiding this comment.
A brief explanation on why chunks matter, for future devs here?
| i += 1 | ||
|
|
||
| logger.info(f"Loading JPK QI data from {self.filepath} with channel {self.channel}") | ||
| self.extract_global_metadata() |
There was a problem hiding this comment.
Make this return the value and set it here instead? Just a little clearer - feel free to override this suggestion.
There was a problem hiding this comment.
Just noticed that the extract global metadata also defines some other instance attributes as well as the main global metadata attribute so I think going to leave as is to keep clean
| # If there are no failed loads, log that all data was loaded successfully | ||
| logger.info("Successfully loaded all curve data without any missing files.") | ||
|
|
||
| def extract_data_to_h5( |
| """ | ||
| Predict the total number of points for each channel and segment. | ||
|
|
||
| This is done by sampling a subset of curves and extrapolating based on the maximum number |
There was a problem hiding this comment.
Perhaps add that if this prediction is wrong, how the curve loading for the excess curve data will be slow due to resizing?
| flip_image=bool(flip_image), | ||
| ) | ||
|
|
||
| def save_lite_data(self): |
There was a problem hiding this comment.
Mention without curves?
There was a problem hiding this comment.
Renamed to extract_and_save_per_curve_data
| h5_datasets_buffer[seg_name][chan["name"]] = {"Data": [], "Indices": []} | ||
| return global_meta_group, h5_datasets, h5_meta_datasets, h5_datasets_buffer, h5_meta_datasets_buffer | ||
|
|
||
| def get_saving_context(self): |
There was a problem hiding this comment.
Stale - been changed on another branch - update?
| # Lookup map for binary scaling | ||
| self.channel_scaling = {chan["name"]: chan for chan in self.segment_channels} | ||
|
|
||
| def close(self): |
Adds support for .jpk-qi-data files, allowing the loading of the image for the selected channel. It also allows for the raw force-distance curve data as well as metadata for each curve to be loaded efficiently, using a lazy data structure approach that loads each section of the dataset into memory only when it is required.
The updates also allow for the conversion of .jpk-qi-data files into a HDF5 file format (.h5-jpk) for much faster mass analysis. Hence, the h5-jpk loading code has also been updated to support the loading of these converted files.
The update also allows .bin with different binary formats to be loaded.
closes #174 closes #173