Thursday, September 29, 2022

[FIXED] Pytorch custom dataset is super slow

September 29, 2022 dataset, python, pytorch No comments

Issue

During training it takes ages to load one batch of data. What can cause this problem? I am new to Pytorch, i had been working with tensorflow for a while, this my first attempt to create something like this. I wrote a custom dataset which gets its images from folders, it gets stored in a dataframe which will be splited it into train val sets.

class CustomDataset(torch.utils.data.Dataset):
def __init__(self, root, split, train_ratio = 0.85, val_ratio = 0.1, transform=None):
    self.root = root

    self.train_ratio = train_ratio
    self.val_ratio = val_ratio
    self.test_ratio = 1 - (self.train_ratio + self.val_ratio)

    df = self.folder2pandas()
    self.split = split 
    self.data = self.splitDf(df)
    self.transform = transform

def __len__(self):
    return len(self.data)

def __getitem__(self, idx):
    row = self.data.iloc[[idx]]

    x = row.values[0][0]
    y = row.values[0][1]

    x = cv2.imread(x)

    if self.sourceTransform:
        x = self.sourceTransform(x)

    return x, y

def folder2pandas(self):
    tuples = []
    for folder, subs, files in os.walk(self.root):
        for filename in files:
            path = os.path.abspath(os.path.join(folder, filename))
            tuples.append((path, folder.split('\\')[-1]))
    return pd.DataFrame(tuples, columns=["x", "y"])

def splitDf(self, df):
    df = df.sort_values(by=['x'], ascending=True).reset_index(drop=True)
    train_idxs = df.loc[range(0, int(self.train_ratio * len(df)))]
    val_idxs = df.loc[range(int(self.train_ratio * len(df)),
                            int(self.train_ratio * len(df)) + int(self.val_ratio * len(df)))]
    test_idxs = df.loc[range(
        int(self.train_ratio * len(df)) + int(self.val_ratio * len(df)), len(df))]

    if self.split == 'train':
        return train_idxs
    elif self.split == 'val':
        return val_idxs
    elif self.split == 'test':
        return test_idxs

Augmentations:

train_transforms = transforms.Compose([
transforms.Resize((224,224)),
transforms.RandomChoice([
    transforms.RandomAutocontrast(),
    transforms.ColorJitter(brightness=0.3, contrast=0.5, saturation=0.1, hue=0.1),
    transforms.GaussianBlur(kernel_size=(5,5), sigma=(0.1, 2.0)),
    transforms.Grayscale(num_output_channels=3),
    transforms.RandomVerticalFlip(),]),
transforms.RandomHorizontalFlip(0.5),
transforms.ToTensor(), 
transforms.Normalize(res[0].numpy(), res[1].numpy()),                                  
])

val_transforms = transforms.Compose([
    transforms.Resize((224,224)),
    transforms.ToTensor(), 
    transforms.Normalize(res[0].numpy(), res[1].numpy()),                                  
])

Initializing datasets:

In 'resources' folder there are two folder which name's represents the labels (BinaryClassification).

train_set=CustomDataset(root="resources/",split='train', transform=train_transforms)
val_set=CustomDataset(root="resources/",split='val', transform=val_transforms)

Giving datasets to dataloader:

trainloader = torch.utils.data.DataLoader(train_set, shuffle = True, batch_size=32, num_workers=4)
testloader = torch.utils.data.DataLoader(val_set, shuffle = True, batch_size=32, num_workers=4)

Solution

Putting the solution of the comments in a cleaner way:

The creation of several workers was taking large amount of time. It seems that on windows the creation of processes can have weird behaviours in terms of time.

As __getitem__() is not called, the problem is not in data loading per se, try removing the num_workers parameters.

testloader = torch.utils.data.DataLoader(val_set, shuffle = True, batch_size=32)

Then if this works, try increasing it and check the behaviour.

Answered By - PlainRavioli

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Thursday, September 29, 2022

[FIXED] Pytorch custom dataset is super slow

Issue

Solution

0 comments:

Post a Comment

Popular Posts

Labels