데이터 분석/딥러닝

[YoLo v1] 논문 리뷰 & 구현 (Pytorch)

개발자 소신 2021. 1. 1. 21:03
반응형

안녕하세요 ! 소신입니다.

 

 

YoLo는 정확도를 조금 포기하고, 속도를 Real-Time 수준으로 끌어올린 모델입니다.

모델링 자체도 2stage로 학습해야했던 R-CNN 시리즈에 비해선 훨씬 간단한편입니다.

 


# YoLo v1

 

YoLo는 원본 이미지를 448x448의 크기로 변환합니다.

convolutional network로 Feature map을 추출한 뒤,

fc layer에서 각 좌표별 object의 위치와 object 여부, class를 예측합니다.

마지막으로 NMS를 통과시켜 ground truth와 IoU를 계산했을 때

가장 높은 점수를 가진 예측 박스만 남기고 제거합니다.

 

위의 글만 보면 조금 헷갈릴 수 있는데,

Faster R-CNN에서 SxS의 Grid별로 Anchor를 설정해서 box를 만들어놓고 Object를 예측했던 것처럼,

각 Grid 좌표가 Object를 판단하고 Class를 분류하는 기준이 됩니다.

 

 


# YoLo v1 Network

YoLo의 Network 구조는 간단합니다.

Google의 Imagenet과 모양이 비슷하게 생겼고, FC layer를 만나기 전, 7x7x1024의 Feature Map을 가지게 됩니다.

FC Layer를 지나, 7x7x30의 형태로 바꾸면 각 Grid별 object 좌표(떨어진 정도)와 Object 여부, 각 class에 대한 확률이 나오게 됩니다.

 

 

따라서, YoLo 모델을 통과시켰을 때 출력값은

Batch Size, 7, 7, 5 * B + class 수 입니다.

 


# YoLo v1 Loss

편의상 SSE를 사용했지만, mAP에선 적절하지 않음

문제 1. localization과 classification error를 동등하게 취급

Ground Truth와 IoU가 가장 큰 Predictor만 취급

해결책 1. Lambda coord (좌표에 대한 가중치), Lambda noobj (Object confidence에 대한 가중치)를

각각 5, 0.5로 주었음

 

 

위의 내용을 토대로 Multi Loss를 하나씩 살펴보면,

1 수식에서 object가 있는 곳 (obj ij)의 grid별로 좌표 편차의 제곱을 계산하여 더함

마찬가지로, 2 수식에서 width와 height에 대한 편차제곱도 계산

3 수식에서 object가 있는 곳의 confidence 확률 차이를 제곱 (1이면 Object라고 예측했어야 함)

4 수식에서 object가 없는 곳의 confidence 확률 편차제곱 계산

5 수식에서 object가 있는 곳의 각 class별로 확률 편차제곱 계산

 

이 수식을 이해하려면, 먼저 image의 ground truth(target)의 구조에 대한 이해가 필요하다.

target도 마찬가지로, 7x7의 좌표별로 30개의 공간이 배정되어있음

가장 먼저, object의 중심 좌표가 grid cell에 어디에 위치하는지를 계산

해당 좌표에 object가 존재한다는 것을 알려주기 위해 c를 1로 함

어떤 클래스인지 알려주기 위해 label의 index (VOC 2007의 경우 0~19)를 받아 해당 위치의 값을 1로 함

 

단지 여기서 의문점은, 한 그리드 셀에 여러개 오브젝트가 있다면, 어떻게 판단하는 것인지

(찾아보니 다음 버전에선 box별로 하나의 막대를 취급하는 것 같음)

 

이렇게 해서 output과 target 사이의 loss를 위의 multi loss수식으로 구할 수 있게 됩니다.

 


# Experiment

논문에서 Epoch = 135

batch size = 64

momentum = 0.9

weight decay = 5e-4를 주고

lr은 초기 0.001

lr scheduler로 첫 에폭에서 0.01까지 서서히 올림 (모델이 튀는것을 방지하기 위한 warm up)

Epoch=75, 105에서 lr 1/10

 

fc layer 사이에 dropout을 줌 (서로 co-adaptation하는 것을 막기위해)

 

yolo에선 darknet을 ImageNet데이터로 pretrain시키고 (224x224) 해당 가중치를 yolo에 가져와서 학습을 진행함

NMS로 Object별 IoU가 가장 큰 박스만 남기고 나머지 제거하면 mAP가 약 2~3% 올라가는 것을 확인

 


# Pytorch 구현

# Download VOC Dataset

!mkdir train
!mkdir test
!wget http://pjreddie.com/media/files/VOCtrainval_06-Nov-2007.tar -P train/
!wget http://pjreddie.com/media/files/VOCtest_06-Nov-2007.tar -P test/
!tar -xf test/VOCtest_06-Nov-2007.tar -C test/
!tar -xf train/VOCtrainval_06-Nov-2007.tar -C train/
!rm -rf test/VOCtest_06-Nov-2007.tar

VOC 2007 Dataset을 다운받아 압축을 풀어줍니다.

 

# Download Package

!pip install xmltodict
!pip install -U albumentations

xml을 dictionary형식으로 바꿔주는 패키지와, transform을 위한 albumentations를 설치해줍니다.

 

# configs
root_dir = '/content'
annot_f = './{}/VOCdevkit/VOC2007/Annotations'
image_f = './{}/VOCdevkit/VOC2007/JPEGImages/{}'

classes = ['person', # Person
           'bird', 'cat', 'cow', 'dog', 'horse', 'sheep', # Animal
           'aeroplane', 'bicycle', 'boat', 'bus', 'car', 'motorbike', 'train', # Vehicle
           'bottle', 'chair', 'dining table', 'potted plant', 'sofa', 'tv/monitor' # Indoor
           ]

num_classes = len(classes)
feature_size = 7
num_bboxes = 2

경로나 코드상에서 자주쓰는 것들을 정의해줍니다.

이미지 경로나 annotations 경로, class들이 있습니다.

 

import sys
sys.path.append(package_dir)

from torch.autograd import Variable
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torch.optim.lr_scheduler

from torch.utils.data import Dataset, DataLoader

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

## utils
import numpy as np
import random, math, time
# 노트북일 때만 사용
from tqdm.notebook import tqdm

## File Loader
import os, xmltodict
import os.path as pth
from PIL import Image

# Draw Image
import matplotlib.pyplot as plt
import matplotlib.patches as patches

## Transformer
from random import sample
import albumentations as A
from albumentations.pytorch.transforms import ToTensor

# Seed
random.seed(53)

 

 

 

 

# utils
def draw_image(image_info, w=448, h=448, transforms=None):
    im = np.array(Image.open(image_f.format('train', image_info['image_id'])).convert('RGB').resize((w,h)), dtype=np.uint8)

    # Create figure and axes
    fig,ax = plt.subplots(1, figsize=(7,7))

    bb = image_info['bboxs']
    la = image_info['labels']

    if transforms:
        sample = transforms(image=im, bboxes=bb, category_ids=la)
        im = sample['image'].permute(1,2,0).numpy()
        bb = sample['bboxes']
        la = sample['category_ids']

    # Display the image
    ax.imshow(im)


    # Create a Rectangle patch
    for b, l in zip(bb, la):
        # top left (x, y) , (w, h)
        rect = patches.Rectangle((b[0]*w,b[1]*h),(b[2]-b[0])*w,(b[3]-b[1])*h,linewidth=1,edgecolor='r',facecolor='none')
        # Add the patch to the Axes
        ax.add_patch(rect)
        props = dict(boxstyle='round', facecolor='red', alpha=0.9)
        plt.text(b[0]*w, b[1]*h, classes[l], fontsize=10, color='white', bbox=props)
    plt.axis('off')
    plt.show()

 

 

transforms 해서 박스를 시각화해주는 함수입니다.

개인적으로 plt를 사용하는게 편해서 plt로 사용가능하게 변환해서 쓰고 있습니다.

 

# dataset # VOC
def get_infos(annot_f=annot_f, mode='train'):
    annot_dir = annot_f.format(mode)
    result = []
    for ano in [pth.join(annot_dir, ano) for ano in os.listdir(annot_dir)]:
        f = open(ano)
        info = xmltodict.parse(f.read())['annotation']
        image_id = info['filename']
        image_size = np.asarray(tuple(map(int, info['size'].values()))[:2], np.int16)
        w, h = image_size
        box_objects = info['object']
        labels = []
        bboxs = []
        for obj in box_objects:
            try:
                labels.append(classes.index(obj['name'].lower()))
                bboxs.append(tuple(map(int, obj['bndbox'].values())))
            except: pass

        # Resizing Box, Change x1 y1 x2 y2
        # albumentations (normalized box)
        bboxs = np.asarray(bboxs, dtype=np.float64)
        try:
            bboxs[:, [0,2]] /= w
            bboxs[:, [1,3]] /= h
        except: pass
        if bboxs.shape[0] or mode=='test':
            result.append({'image_id':image_id, 'image_size':image_size, 'bboxs':bboxs, 'labels':labels})

    return result
    
trval_list = get_infos()
test_list = get_infos(mode='test')

len(trval_list), len(test_list)

image와 box, label정보를 가져와주는 함수입니다.

box는 0~1의 범위로 바꿔줬습니다. labels는 0~19로 해주었습니다.

어차피 yolo에선 target에 background를 class부분에서 설정해주지는 않기 때문에..

 

# utils
def get_tv_idx(tl, k = 0.5):
    total_idx = range(tl)
    train_idx = sample(total_idx, int(tl*k))
    valid_idx = set(total_idx) - set(train_idx)
    return train_idx, list(valid_idx)

train_idx, valid_idx = get_tv_idx(len(trval_list))

trval_list = np.asarray(trval_list)
train_list = trval_list[train_idx]
valid_list = trval_list[valid_idx]

train과 valid data를 나눠주고

 

# Make Dataset
class VOCDataset(Dataset):
    def __init__(self, data_list, mode='train', transforms=None):
        self.data_list = data_list
        self.mode = mode
        self.transforms = transforms

    def __len__(self):
        return len(self.data_list)
    
    def __getitem__(self, idx):
        record = self.data_list[idx]
        img_id = record['image_id']
        bboxs = record['bboxs']
        labels = record['labels']

        img = Image.open(image_f.format(self.mode, img_id)).convert('RGB') #.resize((800,800))
        img = np.array(img)

        if self.transforms:
            for t in self.transforms:
                sample = self.transforms(image=img, bboxes=bboxs, category_ids=labels)
                image = sample['image']
                bboxs = np.asarray(sample['bboxes'])
                labels = np.asarray(sample['category_ids'])


        if self.mode=='train':
            target = encode(bboxs, labels)
            return image, target
        else:
            return image

 

torch dataset을 정의해줍니다.

 

# utils
def encode(bboxs, labels):
    # Make YoLo Target

    S = feature_size
    B = num_bboxes
    N = 5 * B + num_classes
    cell_size = 1.0 / float(S)
    # print(bboxs.shape)

    box_cxy = (bboxs[:, 2:] + bboxs[:, :2])/2.0
    box_wh = bboxs[:, 2:] - bboxs[:, :2]
    target = np.zeros((S, S, N))
    for b in range(bboxs.shape[0]):
        cxy, wh, label = box_cxy[b], box_wh[b], labels[b]
        ij = np.ceil(cxy / cell_size) - 1.0
        i, j = map(int, ij)
        top_left = ij*cell_size
        dxy_norm = (cxy-top_left)/cell_size
        for k in range(B):
            target[j, i, 5*k: 5*(k+1)] = np.r_[dxy_norm, wh, 1]
        target[j, i, 5*B+label] = 1.0
    return target

기존 bboxs를 yolo를 위한 target으로 변환해줍니다.

 

 

# Transformer 재정의
def get_train_transforms():
    return A.Compose([
        A.Resize(448,448, always_apply=True, p=1),
        # A.Cutout(num_holes=7, max_h_size=16, max_w_size=16, fill_value=0, always_apply=False, p=0.5),
        A.RandomBrightnessContrast(p=0.2),
        A.HorizontalFlip(),
        ToTensor(),
    ], bbox_params=A.BboxParams(format='albumentations', label_fields=['category_ids']))

def get_test_transforms():
    return A.Compose([
        A.Resize(448,448, always_apply=True, p=1),
        ToTensor(),
    ])

albumentations를 활용해 transforms를 정의해주고,

cutout을 넣으면 학습할 때 loss가 nan이되는 오류가 있어서 뺐습니다.

 

train_ds = VOCDataset(train_list, transforms=get_train_transforms())
valid_ds = VOCDataset(valid_list, transforms=get_test_transforms())
test_ds = VOCDataset(test_list, mode='test', transforms=get_test_transforms())

# torch tensor를 batch size만큼 묶어줌
def collate_fn(batch):
    images, targets = zip(*batch)
    return torch.cat([img.reshape(-1, 3, 448, 448) for img in images], 0), torch.FloatTensor(targets)

def test_collate_fn(batch):
    images = batch
    return torch.cat([img.reshape(-1, 3, 448, 448) for img in images], 0)

train_loader = DataLoader(train_ds, batch_size=32, shuffle=True, collate_fn=collate_fn)
valid_loader = DataLoader(valid_ds, batch_size=32, shuffle=False, collate_fn=collate_fn)
test_loader = DataLoader(test_ds, batch_size=1, shuffle=False, collate_fn=test_collate_fn)

DataLoader 형태로 정의해주고

 

# YoLo
class YoLo_v1(nn.Module):
    def __init__(self, num_classes=20, num_bboxes=2):
        super(YoLo_v1, self).__init__()

        self.feature_size = 7
        self.num_bboxes=num_bboxes
        self.num_classes=num_classes
        self.conv = nn.Sequential(
            nn.Conv2d(in_channels=3, out_channels=64, kernel_size=7, stride=2, padding=4),
            # nn.BatchNorm2d(64),
            nn.LeakyReLU(0.1, inplace=True),
            nn.MaxPool2d(kernel_size=2, stride=2),

            nn.Conv2d(in_channels=64, out_channels=192, kernel_size=3, stride=1, padding=1),
            # nn.BatchNorm2d(192),
            nn.LeakyReLU(0.1, inplace=True),
            nn.MaxPool2d(kernel_size=2, stride=2),

            nn.Conv2d(in_channels=192, out_channels=128, kernel_size=1, stride=1, padding=0),
            # nn.BatchNorm2d(128),
            nn.LeakyReLU(0.1, inplace=True),
            nn.Conv2d(in_channels=128, out_channels=256, kernel_size=3, stride=1, padding=1),
            # nn.BatchNorm2d(256),
            nn.LeakyReLU(0.1, inplace=True),
            nn.Conv2d(in_channels=256, out_channels=256, kernel_size=1, stride=1, padding=0),
            # nn.BatchNorm2d(256),
            nn.LeakyReLU(0.1, inplace=True),
            nn.Conv2d(in_channels=256, out_channels=512, kernel_size=3, stride=1, padding=1),
            # nn.BatchNorm2d(512),
            nn.LeakyReLU(0.1, inplace=True),
            nn.MaxPool2d(kernel_size=2, stride=2),

            nn.Conv2d(in_channels=512, out_channels=256, kernel_size=1, stride=1, padding=0),
            # nn.BatchNorm2d(256),
            nn.LeakyReLU(0.1, inplace=True),
            nn.Conv2d(in_channels=256, out_channels=512, kernel_size=3, stride=1, padding=1),
            # nn.BatchNorm2d(512),
            nn.LeakyReLU(0.1, inplace=True),
            nn.Conv2d(in_channels=512, out_channels=256, kernel_size=1, stride=1, padding=0),
            # nn.BatchNorm2d(256),
            nn.LeakyReLU(0.1, inplace=True),
            nn.Conv2d(in_channels=256, out_channels=512, kernel_size=3, stride=1, padding=1),
            # nn.BatchNorm2d(512),
            nn.LeakyReLU(0.1, inplace=True),
            nn.Conv2d(in_channels=512, out_channels=256, kernel_size=1, stride=1, padding=0),
            # nn.BatchNorm2d(256),
            nn.LeakyReLU(0.1, inplace=True),
            nn.Conv2d(in_channels=256, out_channels=512, kernel_size=3, stride=1, padding=1),
            # nn.BatchNorm2d(512),
            nn.LeakyReLU(0.1, inplace=True),
            nn.Conv2d(in_channels=512, out_channels=256, kernel_size=1, stride=1, padding=0),
            # nn.BatchNorm2d(256),
            nn.LeakyReLU(0.1, inplace=True),
            nn.Conv2d(in_channels=256, out_channels=512, kernel_size=3, stride=1, padding=1),
            # nn.BatchNorm2d(512),
            nn.LeakyReLU(0.1, inplace=True),
            nn.Conv2d(in_channels=512, out_channels=512, kernel_size=1, stride=1, padding=0),
            # nn.BatchNorm2d(512),
            nn.LeakyReLU(0.1, inplace=True),
            nn.Conv2d(in_channels=512, out_channels=1024, kernel_size=3, stride=1, padding=1),
            # nn.BatchNorm2d(1024),
            nn.LeakyReLU(0.1, inplace=True),
            nn.MaxPool2d(kernel_size=2, stride=2),

            nn.Conv2d(in_channels=1024, out_channels=512, kernel_size=1, stride=1, padding=0),
            # nn.BatchNorm2d(512),
            nn.LeakyReLU(0.1, inplace=True),
            nn.Conv2d(in_channels=512, out_channels=1024, kernel_size=3, stride=1, padding=1),
            # nn.BatchNorm2d(1024),
            nn.LeakyReLU(0.1, inplace=True),
            nn.Conv2d(in_channels=1024, out_channels=512, kernel_size=1, stride=1, padding=0),
            # nn.BatchNorm2d(512),
            nn.LeakyReLU(0.1, inplace=True),
            nn.Conv2d(in_channels=512, out_channels=1024, kernel_size=3, stride=1, padding=1),
            # nn.BatchNorm2d(1024),
            nn.LeakyReLU(0.1, inplace=True),
            nn.Conv2d(in_channels=1024, out_channels=1024, kernel_size=3, stride=1, padding=1),
            # nn.BatchNorm2d(1024),
            nn.LeakyReLU(0.1, inplace=True),
            nn.Conv2d(in_channels=1024, out_channels=1024, kernel_size=3, stride=2, padding=1),
            # nn.BatchNorm2d(1024),
            nn.LeakyReLU(0.1, inplace=True),

            nn.Conv2d(in_channels=1024, out_channels=1024, kernel_size=3, stride=1, padding=1),
            # nn.BatchNorm2d(1024),
            nn.LeakyReLU(0.1, inplace=True),
            nn.Conv2d(in_channels=1024, out_channels=1024, kernel_size=3, stride=1, padding=1),
            # nn.BatchNorm2d(1024),
            nn.LeakyReLU(0.1, inplace=True),
        )

        self.fc = nn.Sequential(
            Flatten(),
            nn.Linear(in_features=7*7*1024, out_features=4096),
            nn.LeakyReLU(0.1, inplace=True),
            nn.Dropout(p=0.5),
            nn.Linear(in_features=4096, out_features=(feature_size*feature_size*(5 * num_bboxes + num_classes))),
            nn.Softmax()
        )
            
        self.init_weight(self.conv)
        self.init_weight(self.fc)

    def forward(self, x):
        s, b, c = self.feature_size, self.num_bboxes, self.num_classes

        x = self.conv(x)
        x = self.fc(x)

        x = x.view(-1, s, s, (5 * b + c))
        return x

    def init_weight(self, modules):
        for m in modules:
            if isinstance(m, nn.Conv2d):
                nn.init.kaiming_normal_(m.weight, mode='fan_in', nonlinearity='leaky_relu')
                if m.bias is not None:
                    nn.init.constant_(m.bias, 0)
            elif isinstance(m, nn.BatchNorm2d):
                nn.init.constant_(m.weight, 1)
                nn.init.constant_(m.bias, 0)
            elif isinstance(m, nn.Linear):
                nn.init.normal_(m.weight, 0, 0.01)
                nn.init.constant_(m.bias, 0)

class Squeeze(nn.Module):
    def __init__(self):
        super(Squeeze, self).__init__()

    def forward(self, x):
        return x.squeeze()

class Flatten(nn.Module):
    def __init__(self):
        super(Flatten, self).__init__()

    def forward(self, x):
        return x.view(x.size(0), -1)

 

모델링 부분입니다. 논문에 있는 Network를 그대로 가져왔습니다.

 

def compute_iou(bbox1, bbox2):
    """ Compute the IoU (Intersection over Union) of two set of bboxes, each bbox format: [x1, y1, x2, y2].
    Args:
        bbox1: (Tensor) bounding bboxes, sized [N, 4].
        bbox2: (Tensor) bounding bboxes, sized [M, 4].
    Returns:
        (Tensor) IoU, sized [N, M].
    """
    N = bbox1.size(0)
    M = bbox2.size(0)

    # Compute left-top coordinate of the intersections
    lt = torch.max(
        bbox1[:, :2].unsqueeze(1).expand(N, M, 2), # [N, 2] -> [N, 1, 2] -> [N, M, 2]
        bbox2[:, :2].unsqueeze(0).expand(N, M, 2)  # [M, 2] -> [1, M, 2] -> [N, M, 2]
    )
    # Conpute right-bottom coordinate of the intersections
    rb = torch.min(
        bbox1[:, 2:].unsqueeze(1).expand(N, M, 2), # [N, 2] -> [N, 1, 2] -> [N, M, 2]
        bbox2[:, 2:].unsqueeze(0).expand(N, M, 2)  # [M, 2] -> [1, M, 2] -> [N, M, 2]
    )
    # Compute area of the intersections from the coordinates
    wh = rb - lt   # width and height of the intersection, [N, M, 2]
    wh[wh < 0] = 0 # clip at 0
    inter = wh[:, :, 0] * wh[:, :, 1] # [N, M]

    # Compute area of the bboxes
    area1 = (bbox1[:, 2] - bbox1[:, 0]) * (bbox1[:, 3] - bbox1[:, 1]) # [N, ]
    area2 = (bbox2[:, 2] - bbox2[:, 0]) * (bbox2[:, 3] - bbox2[:, 1]) # [M, ]
    area1 = area1.unsqueeze(1).expand_as(inter) # [N, ] -> [N, 1] -> [N, M]
    area2 = area2.unsqueeze(0).expand_as(inter) # [M, ] -> [1, M] -> [N, M]

    # Compute IoU from the areas
    union = area1 + area2 - inter # [N, M, 2]
    iou = inter / union           # [N, M, 2]

    return iou

iou 계산입니다. 여러 box가 주어졌을 때, iou 계산하는 코드 자체는 간단합니다

단지 torch로 계산하냐, numpy로 계산하냐에 따라 차이가 있습니다.

 

def loss_fn(pred_tensor, target_tensor):
    """ Compute loss for YOLO training.
    Args:
        pred_tensor: (Tensor) predictions, sized [n_batch, S, S, Bx5+C], 5=len([x, y, w, h, conf]).
        target_tensor: (Tensor) targets, sized [n_batch, S, S, Bx5+C].
    Returns:
        (Tensor): loss, sized [1, ].
    """
    # TODO: Romove redundant dimensions for some Tensors.

    S = feature_size
    B = num_bboxes
    C = num_classes
    N = 5 * B + C 
    lambda_coord = 5#torch.tensor().cuda()
    lambda_noobj = 0.5#torch.tensor(0.5).cuda()


    batch_size = pred_tensor.size(0)
    coord_mask = target_tensor[:, :, :, 4] > 0  # mask for the cells which contain objects. [n_batch, S, S]
    noobj_mask = target_tensor[:, :, :, 4] == 0 # mask for the cells which do not contain objects. [n_batch, S, S]
    coord_mask = coord_mask.unsqueeze(-1).expand_as(target_tensor) # [n_batch, S, S] -> [n_batch, S, S, N]
    noobj_mask = noobj_mask.unsqueeze(-1).expand_as(target_tensor) # [n_batch, S, S] -> [n_batch, S, S, N]

    coord_pred = pred_tensor[coord_mask].view(-1, N)            # pred tensor on the cells which contain objects. [n_coord, N]
                                                                # n_coord: number of the cells which contain objects.
    bbox_pred = coord_pred[:, :5*B].contiguous().view(-1, 5)    # [n_coord x B, 5=len([x, y, w, h, conf])]
    class_pred = coord_pred[:, 5*B:]                            # [n_coord, C]

    coord_target = target_tensor[coord_mask].view(-1, N)        # target tensor on the cells which contain objects. [n_coord, N]
                                                                # n_coord: number of the cells which contain objects.
    bbox_target = coord_target[:, :5*B].contiguous().view(-1, 5)# [n_coord x B, 5=len([x, y, w, h, conf])]
    class_target = coord_target[:, 5*B:]                        # [n_coord, C]

    # Compute loss for the cells with no object bbox.
    noobj_pred = pred_tensor[noobj_mask].view(-1, N)        # pred tensor on the cells which do not contain objects. [n_noobj, N]
                                                            # n_noobj: number of the cells which do not contain objects.
    noobj_target = target_tensor[noobj_mask].view(-1, N)    # target tensor on the cells which do not contain objects. [n_noobj, N]
                                                            # n_noobj: number of the cells which do not contain objects.
    noobj_conf_mask = torch.cuda.ByteTensor(noobj_pred.size()).fill_(0) # [n_noobj, N]
    for b in range(B):
        noobj_conf_mask[:, 4 + b*5] = 1 # noobj_conf_mask[:, 4] = 1; noobj_conf_mask[:, 9] = 1
    noobj_pred_conf = noobj_pred[noobj_conf_mask]       # [n_noobj, 2=len([conf1, conf2])]
    noobj_target_conf = noobj_target[noobj_conf_mask]   # [n_noobj, 2=len([conf1, conf2])]
    loss_noobj = F.mse_loss(noobj_pred_conf, noobj_target_conf, reduction='sum')

    # Compute loss for the cells with objects.
    coord_response_mask = torch.cuda.ByteTensor(bbox_target.size()).fill_(0)    # [n_coord x B, 5]
    coord_not_response_mask = torch.cuda.ByteTensor(bbox_target.size()).fill_(1)# [n_coord x B, 5]
    bbox_target_iou = torch.zeros(bbox_target.size()).cuda()                    # [n_coord x B, 5], only the last 1=(conf,) is used

    # Choose the predicted bbox having the highest IoU for each target bbox.
    for i in range(0, bbox_target.size(0), B):
        pred = bbox_pred[i:i+B] # predicted bboxes at i-th cell, [B, 5=len([x, y, w, h, conf])]
        pred_xyxy = Variable(torch.FloatTensor(pred.size())) # [B, 5=len([x1, y1, x2, y2, conf])]
        # Because (center_x,center_y)=pred[:, 2] and (w,h)=pred[:,2:4] are normalized for cell-size and image-size respectively,
        # rescale (center_x,center_y) for the image-size to compute IoU correctly.
        pred_xyxy[:,  :2] = pred[:, :2]/float(S) - 0.5 * pred[:, 2:4]
        pred_xyxy[:, 2:4] = pred[:, :2]/float(S) + 0.5 * pred[:, 2:4]

        target = bbox_target[i] # target bbox at i-th cell. Because target boxes contained by each cell are identical in current implementation, enough to extract the first one.
        target = bbox_target[i].view(-1, 5) # target bbox at i-th cell, [1, 5=len([x, y, w, h, conf])]
        target_xyxy = Variable(torch.FloatTensor(target.size())) # [1, 5=len([x1, y1, x2, y2, conf])]
        # Because (center_x,center_y)=target[:, 2] and (w,h)=target[:,2:4] are normalized for cell-size and image-size respectively,
        # rescale (center_x,center_y) for the image-size to compute IoU correctly.
        target_xyxy[:,  :2] = target[:, :2]/float(S) - 0.5 * target[:, 2:4]
        target_xyxy[:, 2:4] = target[:, :2]/float(S) + 0.5 * target[:, 2:4]

        iou = compute_iou(pred_xyxy[:, :4], target_xyxy[:, :4]) # [B, 1]
        max_iou, max_index = iou.max(0)
        max_index = max_index.data.cuda()

        coord_response_mask[i+max_index] = 1
        coord_not_response_mask[i+max_index] = 0

        # "we want the confidence score to equal the intersection over union (IOU) between the predicted box and the ground truth"
        # from the original paper of YOLO.
        bbox_target_iou[i+max_index, torch.LongTensor([4]).cuda()] = (max_iou).data.cuda()
    bbox_target_iou = Variable(bbox_target_iou).cuda()

    # BBox location/size and objectness loss for the response bboxes.
    bbox_pred_response = bbox_pred[coord_response_mask].view(-1, 5)      # [n_response, 5]
    bbox_target_response = bbox_target[coord_response_mask].view(-1, 5)  # [n_response, 5], only the first 4=(x, y, w, h) are used
    target_iou = bbox_target_iou[coord_response_mask].view(-1, 5)        # [n_response, 5], only the last 1=(conf,) is used
    loss_xy = F.mse_loss(bbox_pred_response[:, :2], bbox_target_response[:, :2], reduction='sum')
    loss_wh = F.mse_loss(torch.sqrt(bbox_pred_response[:, 2:4]), torch.sqrt(bbox_target_response[:, 2:4]), reduction='sum')
    loss_obj = F.mse_loss(bbox_pred_response[:, 4], target_iou[:, 4], reduction='sum')

    # Class probability loss for the cells which contain objects.
    loss_class = F.mse_loss(class_pred, class_target, reduction='sum')

    # Total loss
    loss = lambda_coord * (loss_xy + loss_wh) + loss_obj + lambda_noobj * loss_noobj + loss_class
    loss = loss / float(batch_size)

    return loss

loss function입니다.

masking으로 object의 여부를 확인하고, IoU를 계산해 가장 높은 IoU를 가지는 Predictor로 target과 비교해 loss를 계산합니다.

나머지는 논문의 multi-loss 수식과 같습니다.

 

yolo = YoLo_v1().cuda()
# criterion = loss_fn().cuda()
init_lr = 0.001
base_lr = 0.01
optimizer = optim.SGD(yolo.parameters(), lr=init_lr, momentum=0.9, weight_decay=0.0005)

def update_lr(optimizer, epoch, step_per, burnin_exp=4.0):
    if epoch in range(50):
        lr = init_lr + (base_lr - init_lr) * math.pow(epoch/(50-1), burnin_exp)
    elif epoch == 50:
        lr = base_lr
    elif epoch == 750:
        lr = 0.001
    elif epoch == 1050:
        lr = 0.0001
    else: return

    for param_group in optimizer.param_groups:
        param_group['lr'] = lr

학습이 되어있지 않은 상황이라고 가정하고,

10배의 epoch을 돌려주기 위해 논문의 실험 환경과 조금 다르게 주었습니다.

 

start_time = time.time()
bl = len(train_loader)
history = {'total_loss':[]}
for epoch in range(1500):  # loop over the dataset multiple times
    tk0 = tqdm(train_loader, total=bl,leave=False)
    t_loss = 0.0
    breaking=False
    for step, (image, target) in enumerate(tk0, 0):
        image, target = image.to(device), target.to(device)
        update_lr(optimizer, epoch, float(step) / float(bl - 1))
        output = yolo(image)
        loss = loss_fn(output, target).cuda()

        if math.isnan(loss):
            print(loss)
            breaking = True
            break
        # zero the parameter gradients
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        t_loss += loss.item()

        history['total_loss'].append(loss.item())

    if breaking:
        break

    # print statistics
    tqdm.write('[Epoch : %d] total_loss: %.5f Total_elapsed_time: %d 분' %
        (epoch + 1, t_loss / bl, (time.time()-start_time)/60))

print(time.time()-start_time)
print('Finished Training')

학습을 위한 함수입니다.

loss가 nan이 되는 경우가 있어서 (cutout 주었을 때) 빼도 상관없습니다.

 

 

def decode(pred_tensor):
    """ Decode tensor into box coordinates, class labels, and probs_detected.
    Args:
        pred_tensor: (tensor) tensor to decode sized [S, S, 5 x B + C], 5=(x, y, w, h, conf)
    Returns:
        boxes: (tensor) [[x1, y1, x2, y2]_obj1, ...]. Normalized from 0.0 to 1.0 w.r.t. image width/height, sized [n_boxes, 4].
        labels: (tensor) class labels for each detected boxe, sized [n_boxes,].
        confidences: (tensor) objectness confidences for each detected box, sized [n_boxes,].
        class_scores: (tensor) scores for most likely class for each detected box, sized [n_boxes,].
    """
    S, B, C = feature_size, num_bboxes, num_classes
    conf_thresh=0.1
    prob_thresh=0.1
    nms_thresh=0.5

    boxes, labels, confidences, class_scores = [], [], [], []

    cell_size = 1.0 / float(S)

    conf = pred_tensor[:, :, 4].unsqueeze(2) # [S, S, 1]
    for b in range(1, B):
        conf = torch.cat((conf, pred_tensor[:, :, 5*b + 4].unsqueeze(2)), 2)
    conf_mask = conf > conf_thresh # [S, S, B]

    # TBM, further optimization may be possible by replacing the following for-loops with tensor operations.
    for i in range(S): # for x-dimension.
        for j in range(S): # for y-dimension.
            class_score, class_label = torch.max(pred_tensor[j, i, 5*B:], 0)

            for b in range(B):
                conf = pred_tensor[j, i, 5*b + 4]
                prob = conf * class_score
                if float(prob) < prob_thresh:
                    continue

                # Compute box corner (x1, y1, x2, y2) from tensor.
                box = pred_tensor[j, i, 5*b : 5*b + 4]
                x0y0_normalized = torch.FloatTensor([i, j]) * cell_size # cell left-top corner. Normalized from 0.0 to 1.0 w.r.t. image width/height.
                xy_normalized = box[:2] * cell_size + x0y0_normalized   # box center. Normalized from 0.0 to 1.0 w.r.t. image width/height.
                wh_normalized = box[2:] # Box width and height. Normalized from 0.0 to 1.0 w.r.t. image width/height.
                box_xyxy = torch.FloatTensor(4) # [4,]
                box_xyxy[:2] = xy_normalized - 0.5 * wh_normalized # left-top corner (x1, y1).
                box_xyxy[2:] = xy_normalized + 0.5 * wh_normalized # right-bottom corner (x2, y2).

                # Append result to the lists.
                boxes.append(box_xyxy)
                labels.append(class_label)
                confidences.append(conf)
                class_scores.append(class_score)

    if len(boxes) > 0:
        boxes = torch.stack(boxes, 0) # [n_boxes, 4]
        labels = torch.stack(labels, 0)             # [n_boxes, ]
        confidences = torch.stack(confidences, 0)   # [n_boxes, ]
        class_scores = torch.stack(class_scores, 0) # [n_boxes, ]
    else:
        # If no box found, return empty tensors.
        boxes = torch.FloatTensor(0, 4)
        labels = torch.LongTensor(0)
        confidences = torch.FloatTensor(0)
        class_scores = torch.FloatTensor(0)

    return boxes, labels, confidences, class_scores

output을 다시 box와 label, object여부, class probability로 돌리는 code입니다.

 

이를 활용해 visualize를 하게 됩니다.

def test_visualize(images, outputs):
    fig,ax = plt.subplots(1, figsize=(7,7))
    img = Image.open(image_f.format('test', test_list[i]['image_id']))
    w, h = test_list[i]['image_size']
    im = np.asarray(img)

    ax.imshow(im)

    for output in outputs:
        b, l, c, sc = decode(output)
        if b.shape[0]:
            rect = patches.Rectangle((b[0]*w,b[1]*h),(b[2]-b[0])*w,(b[3]-b[1])*h,linewidth=1,edgecolor='r',facecolor='none')
            ax.add_path(rect)
            probs = dict(boxstyle='round', facecolor='red', alpha=.9)
            plt.text(b[0]*w, b[1]*h, '%s : %.2f'%(classes[l], sc), fontsize=10, color='white', bbox=props)

함수 내부에서 순서에 맞게 test info를 가져오기 때문에

i라는 특정 순서를 test image의 순서와 맞춰줘야 합니다.

class뿐 아니라 해당 class로 예측한 확률도 나타내줍니다.

 

 

# 결론

colab 1500에폭이 돌고나서 올리겠습니다...

 

 


Ref.

YoLo v1 Article

YoLo v1 Pytorch Github

YoLo Images

 

 

반응형