Creating the ForNet Dataset
We can't just provide the ForNet dataset here, as it's too large to be part of the appendix and using a link will go against double-blind review rules.
After acceptance, the dataset will be downloadable online.
For now, we provide the scripts and steps to recreate the dataset.
In general, if you are unsure what arguments each script allows, run it using the --help flag.
1. Setup paths
Fill in the paths in experiments/general_srun in the Model Training Code folder, as well as in srun-general.sh, slurm-segment-imnet.sh and all the sbatch-segment-... files.
In particular the --container-image, --container-mounts, --output and NLTK_DATA and HF_HOME paths in --export.
2. Pretrain Filtering Models
Use the Model Trainig Code to pretrain an ensemble of models to use for filtering in a later step.
Train those models on either TinyImageNet or ImageNet, depending on if you want to create TinyForNet or ForNet.
The fill in the relevant paths to the pretrained weights in experiments/filter_segmentation_versions.py lines 96/98.
3. Create the dataset
Automatically: using slurm
You may just run the create_dataset.py file (on a slurm head node). That file will automatically run all the necessary steps one after another.
Manually and step-by-step
If you want to run each step of the pipeline manually, follow these steps.
For default arguments and settings, see the create_dataset.py script, even though you may not want to run it directly, it can tell you how to run all the other scripts.
3.1 Segment Objects and Backgrounds
Use the segementation script (segment_imagenet.py) to segment each of the dataset images.
Watch out, as this script uses datadings for image loading, so you need to provide a datadings variant of your dataset.
You need to provide the root folder of the dataset.
Choose your segmentation model using the -model argument (LaMa or AttErase).
If you want to use the >general< prompting strategy, set the --parent_in_promt flag.
Use --output/-o to set the output directory.
Use --processes and -id for splitting the task up into multiple parallelizable processes.
3.2 Filter the segmented images
In this step, you use the pretrained ensemble of models (from step 2) for filtering the segmented images.
As this step is based on the training and model code, it's in the Model Training Code directory.
After setting the relevant paths to the pretrained weights (see step 2), you may run the experiments/filter_segmentation_versions.py script using that directory as the PWD.
3.3 Zip the dataset
In distributed storage settings it might be useful to read from one large (unclompressed) zip file instead of reading millions of small single files. To do this, run
zip -r -0 backgrounds_train.zip train/backgrounds > /dev/null 2>&1
for the train and val backgrounds and foregrounds
3.4 Compute the foreground size ratios
For the resizing step during recombination, the relative size of each object in each image is needed.
To compute it, run the foreground_size_ratio.py script on your filtered dataset.
It expects the zipfiled in the folder you provide as -ds.