# Creating the ForNet Dataset

We can't just provide the ForNet dataset here, as it's too large to be part of the appendix and using a link will go against double-blind review rules.
After acceptance, the dataset will be downloadable online.
For now, we provide the scripts and steps to recreate the dataset.
In general, if you are unsure what arguments each script allows, run it using the `--help` flag.

## 1. Setup paths

Fill in the paths in `experiments/general_srun` in the `Model Training Code` folder, as well as in `srun-general.sh`, `slurm-segment-imnet.sh` and all the `sbatch-segment-...` files.
In particular the `--container-image`, `--container-mounts`, `--output` and `NLTK_DATA` and `HF_HOME` paths in `--export`.

## 2. Pretrain Filtering Models

Use the `Model Trainig Code` to pretrain an ensemble of models to use for filtering in a later step.
Train those models on either `TinyImageNet` or `ImageNet`, depending on if you want to create `TinyForNet` or `ForNet`.
The fill in the relevant paths to the pretrained weights in `experiments/filter_segmentation_versions.py` lines 96/98.

## 3. Create the dataset

### Automatically: using slurm

You may just run the `create_dataset.py` file (on a slurm head node). That file will automatically run all the necessary steps one after another.

### Manually and step-by-step

If you want to run each step of the pipeline manually, follow these steps.
For default arguments and settings, see the `create_dataset.py` script, even though you may not want to run it directly, it can tell you how to run all the other scripts.

#### 3.1 Segment Objects and Backgrounds

Use the segementation script (`segment_imagenet.py`) to segment each of the dataset images.
Watch out, as this script uses `datadings` for image loading, so you need to provide a `datadings` variant of your dataset.
You need to provide the root folder of the dataset.
Choose your segmentation model using the `-model` argument (LaMa or AttErase).
If you want to use the >general< prompting strategy, set the `--parent_in_promt` flag.
Use `--output`/`-o` to set the output directory.
Use `--processes` and `-id` for splitting the task up into multiple parallelizable processes.

#### 3.2 Filter the segmented images

In this step, you use the pretrained ensemble of models (from step 2) for filtering the segmented images.
As this step is based on the training and model code, it's in the `Model Training Code` directory.
After setting the relevant paths to the pretrained weights (see step 2), you may run the `experiments/filter_segmentation_versions.py` script using that directory as the PWD.

#### 3.3 Zip the dataset

In distributed storage settings it might be useful to read from one large (unclompressed) zip file instead of reading millions of small single files.
To do this, run

```commandline
zip -r -0 backgrounds_train.zip train/backgrounds > /dev/null 2>&1
```

for the train and val backgrounds and foregrounds

#### 3.4 Compute the foreground size ratios

For the resizing step during recombination, the relative size of each object in each image is needed.
To compute it, run the `foreground_size_ratio.py` script on your filtered dataset.
It expects the zipfiled in the folder you provide as `-ds`.