CIS 731 | Project : Week 2: Pipeline Repair & Bug Resolution
March 28, 2026
What I worked on
- Moved from high-level observation to a full line-by-line audit of the codebase across all four generations (Gen 0-Gen 3)
- Resolved all 7 bugs in order of severity before running any new training:
-
Tensor shape transposition :
prepare_data.pywas saving tensors as (time × channels) but Conv1d expects (channels×time).- Fixed by transposing before saving and removing the fragile shape-heuristic workaround from
data.py. - All
.ptfiles regenerated.
-
Embedding collapse (loss stuck at 0.5) :
- Gen 3 training flatlined at loss=0.5, meaning the model produced random embeddings unable to separate any subjects.
- Resolved by adding learning rate warmup for the Transformer and fixing corrupted inputs caused by the transpose bug.
-
Evaluation script importing dead Gen 0 code :
evaluate_metrics.pywas importingSiameseFusionfrom Gen 0, making it incompatible with Gen 3’sHybridGaitTransformer. Rewrote the evaluation script to correctly interface with Gen 3.
-
Checkpoint signatures not saved :
- Gen 3 checkpoints were missing
model_signatureandtraining_signaturekeys, causing resume-from-checkpoint to always fail. Fixed by adding both keys at save time.
- Gen 3 checkpoints were missing
-
NaN values from Butterworth filter :
- The high-pass filter was introducing NaN values on near-zero or signal-dropout segments. Added pre-filtering validation
-
Gen 2 cache bug :
generate_cachein Gen 2 had no transpose workaround, causing all files to be skipped as too short. Fixed by aligning cache logic with the corrected tensor orientation.
-
Insufficient batch diversity :
batch_size=64with 8 samples per class yielded only 8 subjects per batch, insufficient for stable BatchHard triplet mining.
-
What I learned
- A single tensor shape mismatch can silently corrupt an entire training run without throwing an error the model trains, loss is logged, but nothing meaningful is being learned.
- Embedding collapse (loss = 0.5) is not always a model architecture problem; it can be entirely caused by bad data flowing in upstream.
- Fixing bugs in the right order matters : the tensor shape fix was a prerequisite for diagnosing whether the embedding collapse was a data issue or a model issue.
Challenges
- Several bugs were silent, they didn’t crash the pipeline, they just produced wrong results. This made them harder to catch than explicit errors.
- Regenerating all
.ptfiles after the tensor fix was time-consuming but necessary to ensure no corrupted cached data persisted.
Next week
- Verify Gen 0 and Gen 1 baselines reproduce correctly on the cleaned pipeline.
- Begin Gen 3 training runs and confirm loss no longer collapses.