[Faster R-CNN] 논문 리뷰 & 구현 (Pytorch)

데이터 분석/딥러닝

[Faster R-CNN] 논문 리뷰 & 구현 (Pytorch)

개발자 소신 2020. 12. 30. 01:01

안녕하세요 ! 소신입니다.

R-CNN이랑 Fast R-CNN은 거의 논문리뷰만 하고 구현은 안했는데,

Faster R-CNN은 구현까지 해보았습니다. (근데 오류가 있는것 같음..)

# Faster R-CNN 구조

Faster R-CNN의 구조는 Fast R-CNN에 Region Proposal Network(RPN)를 추가한 모델입니다.

Fast R-CNN의 단점

Selective Search의 CPU로 연산 (연산 시간의 Bottleneck)

Faster R-CNN의 해결 방안

Region Proposal을 모델에서 처리

이로 인해 진정한 end-to-end 학습이 가능한 모델이 되는데요,

근데 학습과정이 생각보다 까다로웠습니다.

# Faster R-CNN 학습 과정

흔히 Backbone이라고 부르는 Conv Layer에서 Feature Map을 추출합니다.

추출한 Feature Map을 RPN Layer에 넣어, region proposal을 생성합니다.

Feature Map과 RPN을 거쳐 얻은 Region Proposal을 통해 Bounding Box Regression과 Classifier를 수행합니다.

RoI Pooling + RoI Head (Fast R-CNN)

여기서 두 가지 Loss를 통해 학습하게 되는데,

RPN layer의 loss와 RoI의 Loss입니다.

이 둘을 합해서 하나의 Total Loss를 계산하고, 이를 통해 전체 네트워크를 업데이트합니다.

# Faster R-CNN 코드 단위 이해

Faster R-CNN을 코드로 이해하는데 꼬박 3일정도 걸린 것 같습니다. ~~(저는 바보인가봐요...)~~

참고한 코드는

1. Faster RCNN from scratch Github

2. Ganghee-Lee/Faster-RCNN-Tensor Flow Github

3. Simple Faster RCNN pytorch Github

총 세 개 입니다.

1번 코드를 중심으로 설명하자면,

# 패키지 Import, custom utils Load

import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision

from utils import *

utils에는 IoU 계산, non-maximum suppression(NMS), bounding box -> loc 변환, loc -> bounding box 변환

# Sample Data, Feature Extraction

image = torch.zeros((1, 3, 800, 800)).float()
image_size = (800, 800)

# bbox -> y1, x1, y2, x2
bbox = torch.FloatTensor([[20, 30, 400, 500], [300, 400, 500, 600]])
labels = torch.LongTensor([6, 8])

sub_sample = 16

vgg16 = torchvision.models.vgg16(pretrained=True)
req_features = vgg16.features[:30]
print(req_features)
output_map = req_features(image)
print(output_map.shape)

샘플 데이터로 800x800 이미지를 사용했고, bounding box는 2개, labels도 2개입니다.

sub_sample은 down sampling이 얼마나 되는지, 800에서 50으로 줄어듦으로 16입니다. (down sample 4번 2**4)

원본 이미지와 box입니다.

output_map이 Conv Layer (여기선 VGG16)을 지나서 나온 결과입니다.

# Anchor 생성

anchor_scale = [8, 16, 32]
ratio = [0.5, 1, 2] # H/W

len_anchor_scale = len(anchor_scale)
len_ratio = len(ratio)
len_anchor_template = len_anchor_scale * len_ratio
anchor_template = np.zeros((9, 4))

for idx, scale in enumerate(anchor_scale):
    h = scale * np.sqrt(ratio) * sub_sample
    w = scale / np.sqrt(ratio) * sub_sample
    y1 = -h/2
    x1 = -w/2
    y2 = h/2
    x2 = w/2
    anchor_template[idx*len_ratio:(idx+1)*len_ratio, 0] = y1
    anchor_template[idx*len_ratio:(idx+1)*len_ratio, 1] = x1
    anchor_template[idx*len_ratio:(idx+1)*len_ratio, 2] = y2
    anchor_template[idx*len_ratio:(idx+1)*len_ratio, 3] = x2

print(anchor_template)

Anchor scale과 ratio로 box를 판단할 모양을 만들어줍니다. (anchor_template)

anchor 하나당 scale 3, ratio 3으로 총 9개입니다.

template 도식입니다. 크기별로, 모양별로 총 9가지가 나오게 됩니다.

feature_map_size = (50, 50)
# The first center coors is (8, 8)
ctr_y = np.arange(8, 800, 16)
ctr_x = np.arange(8, 800, 16)

ctr = np.zeros((*feature_map_size, 2))
for idx, y in enumerate(ctr_y):
    ctr[idx, :, 0] = y
    ctr[idx, :, 1] = ctr_x
print(ctr.shape)

anchor_template이 위치할 점들을 가져옵니다.

위의 점들을 도식하면 아래와 같이 50x50개의 anchor center들이 생성됩니다.

anchors = np.zeros((*feature_map_size, 9, 4))

for idx_y in range(feature_map_size[0]):
    for idx_x in range(feature_map_size[1]):
        anchors[idx_y, idx_x] = (ctr[idx_y, idx_x] + anchor_template.reshape(-1, 2, 2)).reshape(-1, 4)

anchors = anchors.reshape(-1, 4)
print(anchors.shape) # (22500, 4)

이 점들에 각각 anchor template을 적용해줍니다.

파란색 박스가 800x800 이미지고,

각각의 anchor별로 template을 적용했을 때, 나오는 22500개 (50x50x9)의 anchor box들입니다.

# anchor box labeling for RPN

valid_index = np.where((anchors[:, 0] >= 0)
                      &(anchors[:, 1] >= 0)
                      &(anchors[:, 2] <= 800)
                      &(anchors[:, 3] <= 800))[0]
print(valid_index.shape) # 8940

이미지를 넘어가는 박스는 사실상 사용을 못하는 것이기 때문에, 제외해줍니다.

valid_labels = np.empty((valid_index.shape[0],), dtype=np.int32)
valid_labels.fill(-1)

valid_anchors = anchors[valid_index]

print(valid_anchors.shape) # (8940,4)
print(bbox.shape) # torch.Size([2,4])

RPN에서는 Class는 관심없고, Object가 있는 곳을 Region Proposal하는 것이 중요하므로, label은 1, 0이 됩니다.

또한, 8940개의 proposal 중 겹치는 것도 있고, 불필요한 것들도 있으므로 이를 제거하기 위해

-1이라는 label을 할당해줍니다.

ious = bbox_iou(valid_anchors, bbox.numpy()) # anchor 8940 : bbox 2

pos_iou_thres = 0.7
neg_iou_thred = 0.3

# Scenario A
anchor_max_iou = np.amax(ious, axis=1)
pos_iou_anchor_label = np.where(anchor_max_iou >= pos_iou_thres)[0]
neg_iou_anchor_label = np.where(anchor_max_iou < neg_iou_thred)[0]
valid_labels[pos_iou_anchor_label] = 1
valid_labels[neg_iou_anchor_label] = 0

# Scenario B
gt_max_iou = np.amax(ious, axis=0)
gt_max_iou_anchor_label = np.where(ious == gt_max_iou)[0]
print(gt_max_iou_anchor_label)
valid_labels[gt_max_iou_anchor_label] = 1

valid anchor와 bbox 와의 iou를 계산해줍니다.

ious는 행렬이 되는데 행은 각 anchor가 되고 열은 bbox입니다.

8940, 2의 shape로 Anchor별 bbox와의 iou 값들이 들어가게 됩니다.

이 값을 통해 0.7 이상이면 positive, 0.3보다 작으면 negative를 레이블하게 됩니다.

대부분의 경우 0.7이상인 경우가 많지 않기 때문에, Scenario A에서는 논문대로 하고,

Scenario B에서 박스별 iou 최대값인 애들로 positive로 레이블합니다.

n_sample_anchors = 256
pos_ratio = 0.5

total_n_pos = len(np.where(valid_labels == 1)[0])
n_pos_sample = n_sample_anchors*pos_ratio if total_n_pos > n_sample_anchors*pos_ratio else total_n_pos
n_neg_sample = n_sample_anchors - n_pos_sample

pos_index = np.where(valid_labels == 1)[0]
if len(pos_index) > n_sample_anchors*pos_ratio:
    disable_index = np.random.choice(pos_index, size=len(pos_index)-n_pos_sample, replace=False)
    valid_labels[disable_index] = -1

neg_index = np.where(valid_labels == 0)[0]
disable_index = np.random.choice(neg_index, size=len(neg_index) - n_neg_sample, replace=False)
valid_labels[disable_index] = -1

그리고나서 positive와 negative를 합해 256개만 남기고 제외합니다.

postive가 128개가 되지 않는 경우, 나머지는 negative로 채우게 됩니다.

# Each anchor corresponds to a box

argmax_iou = np.argmax(ious, axis=1)
max_iou_box = bbox[argmax_iou].numpy()
print(max_iou_box.shape) # 8940, 4
print(valid_anchors.shape) # 8940, 4

anchor_loc_format_target = format_loc(valid_anchors, max_iou_box)
print(anchor_loc_format_target.shape) # 8940, 4

위의 코드를 보면, ious에서 Anchor별로 어떤 박스가 iou가 높은지 확인합니다.

(0.37312, 0.38272) 이면 1, (0.38272, 0.37312) 이면 0

이렇게하면, 1 0 1 0 0 0 0 1 0, ... 이라는 8940개의 배열이 생기게 됩니다.

이 index로 box값들을 하나하나 할당해서 8940, 4의 배열을 만듭니다.

그리고나서, utils에 있는(직접 만든) format_loc함수로 anchor box에 location을 할당해줍니다.

(정확히 이해는 못했는데, Regression을 해준다는 의미인 것 같습니다.)

anchor_target_labels = np.empty((len(anchors),), dtype=np.int32)
anchor_target_format_locations = np.zeros((len(anchors), 4), dtype=np.float32)

anchor_target_labels.fill(-1)
anchor_target_labels[valid_index] = valid_labels

anchor_target_format_locations[valid_index] = anchor_loc_format_target

print(anchor_target_labels.shape) # 22500,
print(anchor_target_format_locations.shape) # 22500, 4

이렇게 하면, 최종적으로 label하고, loc을 계산한 앵커들이 나오게 됩니다.

# RPN

위의 과정이 RPN을 위한 사전 작업이라고 볼 수 있습니다.

mid_channel = 512
in_channel = 512
n_anchor = 9

conv1 = nn.Conv2d(in_channel, mid_channel, 3, 1, 1)
reg_layer = nn.Conv2d(mid_channel, n_anchor*4, 1, 1, 0)
cls_layer = nn.Conv2d(mid_channel, n_anchor*2, 1, 1, 0)

VGG Net 기준 conv layer의 output channel이 512이라서 in channel은 512가 되고,

box regression은 anchor 9 * 4 (location)

box classification은 anchor 9 * 2 (object or not)

이 됩니다.

x = conv1(output_map)
anchor_pred_format_locations = reg_layer(x)
anchor_pred_scores = cls_layer(x)

print(anchor_pred_format_locations.shape) # torch.Size([1, 36, 50, 50])
print(anchor_pred_scores.shape) # torch.Size([1, 18, 50, 50])

weight 초기화를 거쳐, 추출한 feature map을 conv에 통과시키고,

location과 class를 예측합니다.

이렇게 되면 각 위치별 (50, 50) regression 과 classification 예측 값이 나오게 됩니다.

anchor_pred_format_locations = anchor_pred_format_locations.permute(0, 2, 3, 1).contiguous().view(1, -1, 4)
anchor_pred_scores = anchor_pred_scores.permute(0, 2, 3, 1).contiguous().view(1, -1, 2)
objectness_pred_scores = anchor_pred_scores[:, :, 1]

위에서 ground truth로 만든 anchor와 비교하기 위해, 형태를 맞춰줍니다.

print(anchor_target_labels.shape)
print(anchor_target_format_locations.shape)
print(anchor_pred_scores.shape)
print(anchor_pred_format_locations.shape)

gt_rpn_format_locs = torch.from_numpy(anchor_target_format_locations)
gt_rpn_scores = torch.from_numpy(anchor_target_labels)

rpn_format_locs = anchor_pred_format_locations[0]
rpn_scores = anchor_pred_scores[0]

target은 bbox를 통해 만든 ground truth 값들, pred는 RPN으로 예측한 값들입니다.

둘 모두 reg 22500, 4 / cls 22500, 1이 됩니다.

numpy에서 torch로 변환해주고, batch 중 1개만 가져와주는 코드 (여기선 batch가 1)

####### Object or not loss
rpn_cls_loss = F.cross_entropy(rpn_scores, gt_rpn_scores.long(), ignore_index=-1)
print(rpn_cls_loss)


####### location loss
mask = gt_rpn_scores > 0
mask_target_format_locs = gt_rpn_format_locs[mask]
mask_pred_format_locs = rpn_format_locs[mask]

print(mask_target_format_locs.shape)
print(mask_pred_format_locs.shape)

x = torch.abs(mask_target_format_locs - mask_pred_format_locs)
rpn_loc_loss = ((x<0.5).float()*(x**2)*0.5 + (x>0.5).float()*(x-0.5)).sum()
print(rpn_loc_loss)

object인지 아닌지는 cross entropy loss를, location은 실제 object인 것만 masking해서 loss 값을 계산합니다.

rpn_lambda = 10
N_reg = mask.float().sum()

rpn_loss = rpn_cls_loss + rpn_lambda / N_reg * rpn_loc_loss
print(rpn_loss)

cls loss와 loc loss는 lambda로 적절하게 합쳐주게 됩니다.

# Generating Proposal to Feed Fast R-CNN

RPN에서 구한 Proposal을 Fast R-CNN에서 학습할것만 남기는 과정입니다.

nms_thresh = 0.7
n_train_pre_nms = 12000
n_train_post_nms = 2000
n_test_pre_nms = 6000
n_test_post_nms = 300
min_size = 16

non-maximum suppression (NMS)으로 같은 클래스 정보를 가지는 박스들끼리 iou값을 비교해, 중복되는 것들은 제외해줍니다. 이 때 threshold가 nms_thresh 0.7입니다.

nms이전에 12000개만 우선적으로 남기게 되고,

nms를 하면 2000개의 최종 proposal만 남습니다.

이 2000개의 proposal로 Fast RCNN을 학습하게 됩니다.

box의 width 와 height가 16보다 작으면, 해당 proposal도 제외합니다.

Test에선 6000개, 300개만 남기게 됩니다. (현재 코드에선 사용 X)

print(anchors.shape) # 22500, 4
print(anchor_pred_format_locations.shape) # 22500, 4

rois = deformat_loc(anchors=anchors, formatted_base_anchor=anchor_pred_format_locations[0].data.numpy())
print(rois.shape) # 22500, 4

print(rois)
#[[ -37.56205856  -83.65124834   55.51502551   96.9647187 ]
# [ -59.50866938  -56.68875009   64.91222143   72.23375052]
# [ -81.40298363  -41.99777969   96.39533509   49.35743635]
# ...
# [ 610.35422226  414.3952291   979.0893042  1163.98340092]
# [ 538.20066833  564.81064224 1041.29725647 1063.15491104]
# [ 432.48094419  606.7697889  1166.24708388  973.39356325]]

이 부분은 좀 헷갈리는데, 예측한 location을 anchors를 통해 rois로 다시 바꿔줍니다. (bounding box)

rois[:, 0:4:2] = np.clip(rois[:, 0:4:2], a_min=0, a_max=image_size[0])
rois[:, 1:4:2] = np.clip(rois[:, 1:4:2], a_min=0, a_max=image_size[1])
print(rois)

# [[  0.           0.          55.51502551  96.9647187 ]
#  [  0.           0.          64.91222143  72.23375052]
#  [  0.           0.          96.39533509  49.35743635]
#  ...
#  [610.35422226 414.3952291  800.         800.        ]
#  [538.20066833 564.81064224 800.         800.        ]
#  [432.48094419 606.7697889  800.         800.        ]]

그리고 이미지 사이즈를 벗어나는 값들은 이미지 크기에 맞게 조정해줍니다.

h = rois[:, 2] - rois[:, 0]
w = rois[:, 3] - rois[:, 1]

valid_index = np.where((h>min_size)&(w>min_size))[0]
valid_rois = rois[valid_index]
valid_scores = objectness_pred_scores[0][valid_index].data.numpy()

그리고 box크기가 16보다 작은 것들은 제외하고

object score를 기준으로 정렬해줍니다.

valid_score_order = valid_scores.ravel().argsort()[::-1]

pre_train_valid_score_order = valid_score_order[:n_train_pre_nms]
pre_train_valid_rois = valid_rois[pre_train_valid_score_order]
pre_train_valid_scores = valid_scores[pre_train_valid_score_order]

print(pre_train_valid_rois.shape) # 12000, 4
print(pre_train_valid_scores.shape) # 12000,
print(pre_train_valid_score_order.shape) # 12000,

nms를 적용하기 전 12000개만 가져오고

keep_index = nms(rois=pre_train_valid_rois, scores=pre_train_valid_scores, nms_thresh=nms_thresh)
post_train_valid_rois = pre_train_valid_rois[keep_index][:n_train_post_nms]
post_train_valid_scores = pre_train_valid_scores[keep_index][:n_train_post_nms]
print(post_train_valid_rois.shape) # 2000, 4
print(post_train_valid_scores.shape) # 2000,

nms를 적용해 2000개의 roi만 남깁니다.

2000개도 생각보다 많습니다.

# anchor box labeling for Fast R-CNN

n_sample = 128
pos_ratio = 0.25
pos_iou_thresh = 0.5
neg_iou_thresh_hi = 0.5
neg_iou_thresh_lo = 0.0

여기서부턴 RPN에서 ground truth를 만드는 과정과 같습니다.

단지 Fast RCNN을 위한 ground truth를 만드는 것이 차이 (실제 클래스, 실제 bounding box)

ious = bbox_iou(post_train_valid_rois, bbox)
print(ious.shape) # 2000, 2

위에서 구한 2000개의 roi와 bbox를 비교해 iou를 계산해줍니다.

RPN에선 8940, 2였는데 2000개만 비교해주면 되니, 2000, 2의 배열이 만들어지게 됩니다.

bbox_assignments = ious.argmax(axis=1)
roi_max_ious = ious.max(axis=1)
roi_target_labels = labels[bbox_assignments]
print(roi_target_labels.shape) # 2000

여기선 anchor에서 큰 값인 애들을 실제 label (6, 8)로 각각 할당해주게 됩니다.

0번째가 크면 6 1번째가 크면 8입니다. (헷갈리신다면 코드 구현 맨 위에서 box label값을 확인해보세요)

6 8 6 6 8 6 6 6과 같은 형태의 배열이 만들어지게 되는데

이게 전부 target일수가 없겠죠?

total_n_pos = len(np.where(roi_max_ious >= pos_iou_thresh)[0])
n_pos_sample = n_sample*pos_ratio if total_n_pos > n_sample*pos_ratio else total_n_pos
n_neg_sample = n_sample - n_pos_sample

print(n_pos_sample) # 10
print(n_neg_sample) # 118

그래서 positive threshold에 따라 positive인 애들과 negative인 애들을 128개만 sampling합니다. (n_sample)

pos_index = np.where(roi_max_ious >= pos_iou_thresh)[0]
pos_index = np.random.choice(pos_index, size=n_pos_sample, replace=False)

neg_index = np.where((roi_max_ious < neg_iou_thresh_hi) & (roi_max_ious > neg_iou_thresh_lo))[0]
neg_index = np.random.choice(neg_index, size=n_neg_sample, replace=False)

print(pos_index.shape) # 10
print(neg_index.shape) # 118

positive index와 negative index를 가져오고, (pos ratio가 0.25이기 때문에 positive box는 최대 32개만 가져옴)

keep_index = np.append(pos_index, neg_index)
post_sample_target_labels = roi_target_labels[keep_index].data.numpy()
post_sample_target_labels[len(pos_index):] = 0
post_sample_rois = post_train_valid_rois[keep_index]

최종적으로 sampling까지 끝낸 roi들만 남기게 됩니다.

그 중에 positive만 뽑아보면 위의 그래프와 같습니다.

왼쪽 아래 박스가 라벨 6이고, 오른쪽 위 박스가 라벨 8이니

초록색은 6라벨을 위한 roi box가 되고, 빨간색은 라벨 8을 위한 roi box가 됩니다.

post_sample_bbox = bbox[bbox_assignments[keep_index]]
post_sample_format_rois = format_loc(anchors=post_sample_rois, base_anchors=post_sample_bbox.data.numpy())
print(post_sample_format_rois.shape)

이를 Fast R-CNN과 비교하기 위한 loc 형태로 변환해주면 target box도 끝

# Fast R-CNN

rois = torch.from_numpy(post_sample_rois).float()
print(rois.shape) # 128, 4
# roi_indices = torch.zeros((len(rois),1), dtype=torch.float32)
# print(rois.shape, roi_indices.shape)

# indices_and_rois = torch.cat([roi_indices, rois], dim=1)
# print(indices_and_rois.shape)

roi를 torch로 변환해주고,

밑에 주석처리된 코드는 batch별로 계산해주기 위해

batch별 index를 할당해주고 인덱스, roi로 배열을 만들어주는 코드입니다 (여기선 batch 1)

RoI Pooling

size = (7, 7)
adaptive_max_pool = nn.AdaptiveMaxPool2d(size)

# correspond to feature map
rois.mul_(1/16.0)
rois = rois.long()

roi pooling을 통해 고정된 크기로 추출합니다.

그리고 128개의 rois들은 각각 50,50의 공간에 매핑됩니다.

output = []
num_rois = len(rois)
for roi in rois:
    roi_feature = output_map[..., roi[0]:roi[2]+1, roi[1]:roi[3]+1]
    output.append(adaptive_max_pool(roi_feature))
output = torch.cat(output, 0)
print(output.shape) # 128, 512, 7, 7

각각의 roi를 pooling layer를 거쳐, 고정된 크기로 추출해주면 128, 512, 7, 7의 결과가 나오게 됩니다.

이미지 크기에 자유롭기 위해 roi pooling layer를 사용해준 모습이고,

output_ROI_pooling = output.view(output.size(0), -1)
print(output_ROI_pooling.shape) # 128, 25088

이를 일자로 펴주게되면 128, 25088의 배열이 나오게 됩니다.

RoI Head & Classifier, BBox Regression

roi_head = nn.Sequential(nn.Linear(25088, 4096),
                        nn.Linear(4096, 4096))

cls_loc = nn.Linear(4096, 21*4)
cls_loc.weight.data.normal_(0, 0.01)
cls_loc.bias.data.zero_()

cls_score = nn.Linear(4096, 21)
cls_score.weight.data.normal_(0, 0.01)
cls_score.bias.data.zero_()

x = roi_head(output_ROI_pooling)
roi_cls_loc = cls_loc(x)
roi_cls_score = cls_score(x)

print(roi_cls_loc.shape, roi_cls_score.shape) # 128, 84 / 128, 21

최종적으로 fully connected layer를 거쳐, 20 (class) + 1 (background)로 분류하게 됩니다.

location은 *4 (x1,y1,x2,y2)

Fast R-CNN Loss

print(roi_cls_loc.shape) # 128, 84
print(roi_cls_score.shape) # 128, 21

예측값

print(post_sample_format_rois.shape) # 128, 4
print(post_sample_target_labels.shape) # 128, 

gt_roi_cls_loc = torch.from_numpy(post_sample_format_rois).float()
gt_roi_cls_label = torch.from_numpy(post_sample_target_labels).long()

실제값 ground truth입니다.

roi_cls_loss = F.cross_entropy(roi_cls_score, gt_roi_cls_label)
print(roi_cls_loss)

cls loss는 cross entropy loss를

num_roi = roi_cls_loc.size(0)
roi_cls_loc = roi_cls_loc.view(-1, 21, 4)
roi_cls_loc = roi_cls_loc[torch.arange(num_roi), gt_roi_cls_label]
print(roi_cls_loc.shape)

mask = gt_roi_cls_label>0
mask_loc_pred = roi_cls_loc[mask]
mask_loc_target = gt_roi_cls_loc[mask]

print(mask_loc_pred.shape) # 10, 4
print(mask_loc_target.shape) # 10, 4

x = torch.abs(mask_loc_pred-mask_loc_target)
roi_loc_loss = ((x<0.5).float()*x**2*0.5 + (x>0.5).float()*(x-0.5)).sum()
print(roi_loc_loss)

Fast R-CNN도 마찬가지로 masking처리해서 label이 background가 아닌 것들만 bounding box regression하게 됩니다.

roi_lambda = 10
N_reg = (gt_roi_cls_label>0).float().sum()
roi_loss = roi_cls_loss + roi_lambda / N_reg * roi_loc_loss
print(roi_loss)

lambda를 적용해 Fast R-CNN의 Total loss를 구할 수 있습니다.

# Faster R-CNN Total Loss

total_loss = rpn_loss + roi_loss

Faster R-CNN의 Total loss는 rpn_loss와 roi_loss를 합친 값이 됩니다.

이 loss를 backward해서 network를 update하면 됩니다.

이 과정을 모듈화하는게 헬포인트

# Pytorch 클래스화

정리되지않아 코드가 길어, 접어놓았습니다. (오류도 좀 있고 이미지 크기나 여러 상황에 일반화가 되지 않았습니다.)

faster_rcnn.py

import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision
from torchvision.ops import RoIPool

import numpy as np

from utils import *

# Backbone
from backbone import get_bb_clf

# bbox = torch.FloatTensor([[30,20,500,400], [400,300,600,500]])
# labels = torch.LongTensor([6, 8])
import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision

import numpy as np

from utils import *
from creator_tools import *
# bbox = torch.FloatTensor([[30,20,500,400], [400,300,600,500]])
# labels = torch.LongTensor([6, 8])

# RPN
class RPN(nn.Module):
    def __init__(
                self, in_c=512, mid_c=512,
                image_size=(800,800), sub_sample=16, 
                anchor_scale=[8,16,32], ratio=[0.5,1,2],
    ):
        super(RPN, self).__init__()

        self.rpn = nn.Conv2d(in_c, mid_c, 3, 1, 1)
        self.relu = nn.ReLU(inplace=True)
        
        n_anchor = len(anchor_scale) * len(ratio)

        self.reg = nn.Conv2d(mid_c, n_anchor*4, 1, 1, 0)
        self.cls = nn.Conv2d(mid_c, n_anchor*2, 1, 1, 0)

        self.anchor_base = generate_anchors(image_size, sub_sample=sub_sample,
                                            anchor_scale=anchor_scale, ratio=ratio)
        self.proposal_layer = ProposalCreator(self)

        weight_init(self.rpn)
        weight_init(self.reg)
        weight_init(self.cls)

    # x : feature map
    def forward(self, x, img_size, scale=1.):

        n, _, h, w = x.shape
        anchor = self.anchor_base
        n_anchor = anchor.shape[0] // (h * w) # 9

        x = self.rpn(x)
        x = self.relu(x)
        pred_loc = self.reg(x) # batch, anchor*4, height, width
        pred_cls = self.cls(x) # batch, anchor*2, height, width

        pred_loc = pred_loc.permute(0, 2, 3, 1).contiguous().view(n, -1, 4) # batch anchors (coor)
        pred_cls = pred_cls.permute(0, 2, 3, 1).contiguous() # batch anchors (obj)
        pred_sfmax_cls = F.softmax(pred_cls.view(n, h, w, n_anchor, 2), dim=4)
        pred_fg_cls = pred_sfmax_cls[:,:,:,:,1].contiguous()
        pred_fg_cls = pred_fg_cls.view(n, -1)
        pred_cls = pred_cls.view(n, -1, 2)

        pred_object = pred_cls[:, :, 1]
        rois = []
        roi_indices = []
        for i in range(n):
            roi = self.proposal_layer(
                pred_loc[i].cpu().data.numpy(),
                pred_fg_cls[i].cpu().data.numpy(),
                anchor, img_size,scale=scale)
            batch_index = i * np.ones((len(roi),), dtype=np.int32)
            rois.append(roi)
            roi_indices.append(batch_index)

        rois = np.concatenate(rois, axis=0)
        roi_indices = np.concatenate(roi_indices, axis=0)
        return pred_loc, pred_cls, rois, roi_indices, anchor

# target_loc, target_cls = assign_cls_loc로 만든 것
def rpn_loss(pred_loc, pred_cls, target_loc, target_cls, rpn_lamda=10):
    # cls loss
    # print(pred_cls.shape)
    gt_rpn_cls = torch.from_numpy(target_cls).long().to('cuda:0')
    pred_rpn_cls = pred_cls[0].to('cuda:0')
    # print(pred_rpn_cls.shape, gt_rpn_cls.shape)
    rpn_cls_loss = F.cross_entropy(pred_rpn_cls, gt_rpn_cls, ignore_index=-1)

    # reg loss
    gt_rpn_loc = torch.from_numpy(target_loc).to('cuda:0')
    pred_rpn_loc = pred_loc[0].to('cuda:0')

    mask = gt_rpn_cls > 0
    mask_gt_loc = gt_rpn_loc[mask]
    mask_pred_loc = pred_rpn_loc[mask]
    

    x = torch.abs(mask_gt_loc - mask_pred_loc)
    rpn_loc_loss = ((x<0.5).float()*(x**2)*0.5 + (x>0.5).float()*(x-0.5)).sum()

    N_reg = mask.float().sum()
    rpn_loss = rpn_cls_loss + rpn_lamda / N_reg * rpn_loc_loss

    return rpn_cls_loss, rpn_loc_loss, rpn_loss

# class RoIHead(nn.Module):
#     def __init__(self, n_class, roi_size, spatial_scale, classifier):
#         super(RoIHead, self).__init__()

#         self.classifier = classifier
#         self.cls_loc

class FastRCNN(nn.Module):
    def __init__(self, classifier, n_class=21, size=(7,7), spatial_scale=(1./16)):
        super(FastRCNN, self).__init__()
        self.roi = RoIPool(size, spatial_scale)
        self.roi_pool = nn.AdaptiveMaxPool2d(size)
        self.classifier = classifier

        self.reg = nn.Linear(4096, n_class*4)
        weight_init(self.reg)
        self.cls = nn.Linear(4096, n_class)
        weight_init(self.cls)
        
    def forward(self, feature_map, rois, roi_indices):
        # correspond to feature map
        roi_indices = totensor(roi_indices).float()
        rois = totensor(rois).float()
        indices_rois = t.cat([roi_indices[:, None], rois], dim=1).contiguous()
        
        pool = self.roi(feature_map, indices_rois)
        pool = pool.view(pool.size(0), -1)
        
        x = self.classifier(pool)
        roi_loc = self.reg(x)
        roi_cls = self.cls(x)

        return roi_loc, roi_cls

# gt_loc = torch.from_numpy(final_rois).float()
# gt_cls = torch.from_numpy(final_cls).long()
def fastrcnn_loss(roi_loc, roi_cls, gt_loc, gt_cls): # [128, 84], [128, 21], [128, 4] torch float, [128, 1] torch long
    roi_cls = roi_cls.to('cuda:0')
    gt_cls = gt_cls.to('cuda:0')
    roi_loc = roi_loc.to('cuda:0')
    gt_loc = torch.from_numpy(gt_loc).float().to('cuda:0')
    # print(roi_cls)
    # print(roi_cls.shape, gt_cls.shape, roi_loc.shape, gt_loc.shape)
    cls_loss = F.cross_entropy(roi_cls, gt_cls)
    # print(cls_loss)

    num_roi = roi_loc.size(0)
    roi_loc = roi_loc.view(-1, 21, 4)
    roi_loc = roi_loc[torch.arange(num_roi), gt_cls]

    mask = gt_cls>0
    mask_loc_pred = roi_loc[mask]
    mask_loc_target = gt_loc[mask]

    x = torch.abs(mask_loc_pred-mask_loc_target)
    loc_loss = ((x<0.5).float()*x**2*0.5 + (x>0.5).float()*(x-0.5)).sum()

    # print(loc_loss)

    roi_lamda = 10
    N_reg = (gt_cls>0).float().sum()
    roi_loss = cls_loss + roi_lamda / N_reg * loc_loss

    return cls_loss, loc_loss, roi_loss

class FasterRCNN(nn.Module):
    def __init__(self, backbone, rpn, head):
        super(FasterRCNN, self).__init__()
        self.backbone = backbone
        self.rpn = rpn
        self.head = head # Fast R-CNN
        self.proposal_target_creator = ProposalTargetCreator()

    def forward(self, img, bboxes, labels):
        b, c, h, w = img.shape

        ##### backbone
        feature_map = self.backbone(img)

        ##### RPN
        # anchors = generate_anchors((w, h))
        # target_cls, target_loc = assign_cls_loc(bboxes, anchors, (w, h))
        pred_loc, pred_cls, rois, roi_indices, anchor = self.rpn(feature_map, (w,h), scale=1.)
        target_cls, target_loc = assign_cls_loc(bboxes, anchor, (w,h))
        rpn_loc_loss, rpn_cls_loss, t_rpn_loss = rpn_loss(pred_loc, pred_cls, target_loc, target_cls)
        # pred_loc, pred_cls, pred_object = 
        sample_roi, gt_roi_loc, gt_roi_label = self.proposal_target_creator(rois, bboxes, labels)
        sample_roi_index = t.zeros(len(sample_roi))
        ##### HEAD - Fast RCNN
        final_loc, final_cls = self.head(feature_map, sample_roi, sample_roi_index)

        roi_cls_loss, roi_loc_loss, t_roi_loss = fastrcnn_loss(final_loc, final_cls, gt_roi_loc, gt_roi_label)
        t_loss = torch.sum(t_roi_loss + t_rpn_loss)
        return rpn_loc_loss, rpn_cls_loss, roi_cls_loss, roi_loc_loss, t_loss


        # post_train_rois, post_train_scores = generate_proposal(anchors, pred_loc, pred_cls, pred_object, (w, h))
        # final_rois, final_cls = assign_targets(post_train_rois, post_train_scores, bboxes, labels)
        # final_rois, final_cls = torch.from_numpy(final_rois).float(), torch.from_numpy(final_cls).long()
        # rois = torch.from_numpy(final_rois).float()
        # roi_loc, roi_cls = self.fastrcnn(feature_map, final_rois)
        
        # gt_loc = final_rois
        # gt_cls = final_cls

        # return final_loc, final_cls, rois, roi_indices

def fasterrcnn_loss(rpn_loss, roi_loss):
    return torch.sum(rpn_loss + roi_loss)


class FasterRCNNSEMob(FasterRCNN):
    down_size = 16

    def __init__(self, n_fg_class=20, ratios=[0.5, 1, 2], anchor_scales=[8,16,32]):
        backbone, classifier = get_bb_clf()
        rpn = RPN()
        head = FastRCNN(classifier, n_class=n_fg_class+1, spatial_scale=(1./16))
        
        super(FasterRCNNSEMob, self).__init__(backbone, rpn, head)


def assign_cls_loc(bboxes, anchors, image_size, pos_thres=0.7, neg_thres=0.3, n_sample=256, pos_ratio=0.5):
    valid_idx = np.where((anchors[:, 0] >= 0)
                        &(anchors[:, 1] >= 0)
                        &(anchors[:, 2] <= image_size[0])
                        &(anchors[:, 3] <= image_size[1]))[0]
    # print(valid_idx.shape)

    valid_cls = np.empty((valid_idx.shape[0], ), dtype=np.int32)
    valid_cls.fill(-1)

    valid_anchors = anchors[valid_idx]

    ious = bbox_iou(valid_anchors, bboxes.numpy())
    # print(ious.shape) # 8940, 2

    # valid cls에 positive로 판단하는 것이 총 두 시나리오에 의해 생성됨
    # a
    iou_by_anchor = np.amax(ious, axis=1) # anchor별 최대값
    pos_idx = np.where(iou_by_anchor >= pos_thres)[0]
    neg_idx = np.where(iou_by_anchor < neg_thres)[0]
    valid_cls[pos_idx] = 1
    valid_cls[neg_idx] = 0
    # b
    iou_by_gt = np.amax(ious, axis=0) # gt box별 최대 값
    gt_idx = np.where(ious == iou_by_gt)[0]
    # print(gt_idx)
    valid_cls[gt_idx] = 1


    total_n_pos = len(np.where(valid_cls == 1)[0])
    n_pos = int(n_sample*pos_ratio) if total_n_pos > n_sample*pos_ratio else total_n_pos
    n_neg = n_sample - n_pos

    # valid label에서 256개 넘는 것은 제외
    pos_index = np.where(valid_cls == 1)[0]
    # print(pos_index, len(pos_index, n_pos))
    if len(pos_index) > n_sample*pos_ratio:
        disable_index = np.random.choice(pos_index, size=len(pos_index)-n_pos, replace=False)
        valid_cls[disable_index] = -1
    neg_index = np.where(valid_cls == 0)[0]
    disable_index = np.random.choice(neg_index, size=len(neg_index) - n_neg, replace=False)
    valid_cls[disable_index] = -1

    # 최종 valid class (object or not)
    # print(len(np.where(valid_cls==1)[0]), len(np.where(valid_cls==0)[0]))

    # valid loc
    # Anchor별로 iou가 더 높은쪽으로 loc 분배
    argmax_iou = np.argmax(ious, axis=1)
    max_iou_box = bboxes[argmax_iou].numpy() # valid_anchors와 shape 같아야함

    valid_loc = format_loc(valid_anchors, max_iou_box)
    # print(valid_loc.shape) # 8940, 4 dx dy dw dh

    # 기존 anchor에서 valid index에 지금까지 구한 valid label (pos, neg 18, 238) 할당
    target_cls = np.empty((len(anchors),), dtype=np.int32)
    target_cls.fill(-1)
    target_cls[valid_idx] = valid_cls

    # 기존 anchor에서 valid index에 지금까지 구한 dx, dy, dw, dh 할당
    target_loc = np.zeros((len(anchors), 4), dtype=np.float32)
    target_loc[valid_idx] = valid_loc

    # print(target_cls.shape)
    # print(target_loc.shape)

    return target_cls, target_loc
    

# for Fast RCNN
def generate_proposal(anchors, pred_loc, pred_cls, pred_object, image_size,
                        n_train_pre_nms=12000, 
                        n_train_post_nms=2000, 
                        n_test_pre_nms=6000, 
                        n_test_post_nms=300, 
                        min_size=16, nms_thresh=0.7):
    rois = deformat_loc(anchors=anchors, formatted_base_anchor=pred_loc[0].cpu().data.numpy())
    np.where(rois[:,0])
    rois[:, [0,2]] = np.clip(rois[:, [0,2]], a_min=0, a_max=image_size[0]) # x [0 ~ 800] width
    rois[:, [1,3]] = np.clip(rois[:, [1,3]], a_min=0, a_max=image_size[1]) # y [0 ~ 800] height
    w = rois[:, 2] - rois[:, 0]
    h = rois[:, 3] - rois[:, 1]

    valid_idx = np.where((h>min_size)&(w>min_size))[0]
    valid_rois = rois[valid_idx]
    valid_scores = pred_object[0][valid_idx].cpu().data.numpy()
    
    order_idx = valid_scores.ravel().argsort()[::-1]

    pre_train_idx = order_idx[:n_train_pre_nms]

    pre_train_rois = valid_rois[pre_train_idx]
    pre_train_scores = valid_scores[pre_train_idx]

    keep_index = nms(rois=pre_train_rois, scores=pre_train_scores, nms_thresh=nms_thresh)
    post_train_rois = pre_train_rois[keep_index][:n_train_post_nms]
    post_train_scores = pre_train_scores[keep_index][:n_train_post_nms]

    return post_train_rois, post_train_scores

def assign_targets(post_train_rois, post_train_scores, bboxes, labels, 
                    n_sample = 128,
                    pos_ratio = 0.25,
                    pos_thresh = 0.5,
                    neg_thresh_hi = 0.5,
                    neg_thresh_lo = 0.0):
    ious = bbox_iou(post_train_rois, bboxes.numpy())

    # cls
    bbox_idx = ious.argmax(axis=1)
    box_max_ious = ious.max(axis=1)
    final_cls = labels[bbox_idx] # 2000, Object Class 값 들어감

    total_n_pos = len(np.where(box_max_ious >= pos_thresh)[0])
    n_pos = int(n_sample*pos_ratio) if total_n_pos > n_sample*pos_ratio else total_n_pos
    n_neg = n_sample - n_pos

    pos_index = np.where(box_max_ious >= pos_thresh)[0]
    pos_index = np.random.choice(pos_index, size=n_pos, replace=False)

    neg_index = np.where((box_max_ious < neg_thresh_hi) & (box_max_ious >= neg_thresh_lo))[0]
    neg_index = np.random.choice(neg_index, size=n_neg, replace=False)

    keep_index = np.append(pos_index, neg_index)
    final_cls = final_cls[keep_index].data.numpy()
    final_cls[len(pos_index):] = 0

    final_rois = post_train_rois[keep_index]
    post_sample_bbox = bboxes[bbox_idx[keep_index]]

    d_rois = format_loc(anchors=final_rois, base_anchors=post_sample_bbox.data.numpy())

    return final_rois, final_cls


def weight_init(l):
    if type(l) in [nn.Conv2d]:
        l.weight.data.normal_(0, 0.01)
        l.bias.data.zero_()

if __name__ == "__main__":
    pass

backbone.py

import torch.nn as nn

class SEBlock(nn.Module):
    def __init__(self, c, r=16):
        super(SEBlock, self).__init__()
        self.squeeze = nn.AdaptiveAvgPool2d(1)
        self.excitation = nn.Sequential(
            nn.Linear(c, c // r, bias=False),
            nn.ReLU(inplace=True),
            nn.Linear(c // r, c, bias=False),
            nn.Sigmoid()
        )

    def forward(self, x):
        b, c, _, _ = x.size()
        se = self.squeeze(x).view(b, c)
        se = self.excitation(se).view(b, c, 1, 1)
        return x * se.expand_as(x)

def mobile_block(in_dim, out_dim, stride=1):
    return nn.Sequential(
        nn.Conv2d(in_channels=in_dim, out_channels=in_dim, kernel_size=3, stride=stride, padding=1, groups=in_dim),
        nn.BatchNorm2d(in_dim),
        nn.ReLU(inplace=True),
        nn.Conv2d(in_channels=in_dim, out_channels=out_dim, kernel_size=1, stride=1, padding=0),
        nn.BatchNorm2d(out_dim),
        nn.ReLU(inplace=True),
        SEBlock(c=out_dim, r=16),
    )

class SEMobileNet(nn.Module):
    def __init__(self, width_multi=1, resolution_multi=1, num_classes=1000):
        super(SEMobileNet, self).__init__()
        base_width = int(32 * width_multi)
        self.conv = nn.Sequential(
            nn.Conv2d(in_channels=3, out_channels=base_width, kernel_size=3, stride=2, padding=1),
            nn.BatchNorm2d(32),
            nn.ReLU(inplace=True),

            mobile_block(base_width, base_width*2),
            mobile_block(base_width*2, base_width*4, 2),
            mobile_block(base_width*4, base_width*4),
            mobile_block(base_width*4, base_width*8, 2),
            mobile_block(base_width*8, base_width*8),
            mobile_block(base_width*8, base_width*16, 2), # 800x800 -> 50x50
            *[mobile_block(base_width*16, base_width*16) for _ in range(5)], # 512 channel
            mobile_block(base_width*16, base_width*32, 2),
            mobile_block(base_width*32, base_width*32),

            nn.AvgPool2d(7),
        )
        self.classifier = nn.Linear(base_width*32, num_classes)

    def forward(self, x):
        x = self.conv(x)
        x = x.view(x.size(0), -1)
        x = self.classifier(x)
        return x

def get_bb_clf():
    model = SEMobileNet()
    backbone = model.conv[:-3]
    for p in backbone[:2].parameters():
        p.requires_grad = False
    return backbone, nn.Sequential(nn.Linear(512*7*7, 4096), nn.ReLU(inplace=True))

creator_tools.py (simple faster rcnn code)

from utils import *
import torch

class ProposalCreator:
    """Proposal regions are generated by calling this object.
    The :meth:`__call__` of this object outputs object detection proposals by
    applying estimated bounding box offsets
    to a set of anchors.
    This class takes parameters to control number of bounding boxes to
    pass to NMS and keep after NMS.
    If the paramters are negative, it uses all the bounding boxes supplied
    or keep all the bounding boxes returned by NMS.
    This class is used for Region Proposal Networks introduced in
    Faster R-CNN [#]_.
    .. [#] Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun. \
    Faster R-CNN: Towards Real-Time Object Detection with \
    Region Proposal Networks. NIPS 2015.
    Args:
        nms_thresh (float): Threshold value used when calling NMS.
        n_train_pre_nms (int): Number of top scored bounding boxes
            to keep before passing to NMS in train mode.
        n_train_post_nms (int): Number of top scored bounding boxes
            to keep after passing to NMS in train mode.
        n_test_pre_nms (int): Number of top scored bounding boxes
            to keep before passing to NMS in test mode.
        n_test_post_nms (int): Number of top scored bounding boxes
            to keep after passing to NMS in test mode.
        force_cpu_nms (bool): If this is :obj:`True`,
            always use NMS in CPU mode. If :obj:`False`,
            the NMS mode is selected based on the type of inputs.
        min_size (int): A paramter to determine the threshold on
            discarding bounding boxes based on their sizes.
    """

    def __init__(self,
                 parent_model,
                 nms_thresh=0.7,
                 n_train_pre_nms=12000,
                 n_train_post_nms=2000,
                 n_test_pre_nms=6000,
                 n_test_post_nms=300,
                 min_size=16
                 ):
        self.parent_model = parent_model
        self.nms_thresh = nms_thresh
        self.n_train_pre_nms = n_train_pre_nms
        self.n_train_post_nms = n_train_post_nms
        self.n_test_pre_nms = n_test_pre_nms
        self.n_test_post_nms = n_test_post_nms
        self.min_size = min_size

    def __call__(self, loc, score,
                 anchor, img_size, scale=1.):
        """input should  be ndarray
        Propose RoIs.
        Inputs :obj:`loc, score, anchor` refer to the same anchor when indexed
        by the same index.
        On notations, :math:`R` is the total number of anchors. This is equal
        to product of the height and the width of an image and the number of
        anchor bases per pixel.
        Type of the output is same as the inputs.
        Args:
            loc (array): Predicted offsets and scaling to anchors.
                Its shape is :math:`(R, 4)`.
            score (array): Predicted foreground probability for anchors.
                Its shape is :math:`(R,)`.
            anchor (array): Coordinates of anchors. Its shape is
                :math:`(R, 4)`.
            img_size (tuple of ints): A tuple :obj:`height, width`,
                which contains image size after scaling.
            scale (float): The scaling factor used to scale an image after
                reading it from a file.
        Returns:
            array:
            An array of coordinates of proposal boxes.
            Its shape is :math:`(S, 4)`. :math:`S` is less than
            :obj:`self.n_test_post_nms` in test time and less than
            :obj:`self.n_train_post_nms` in train time. :math:`S` depends on
            the size of the predicted bounding boxes and the number of
            bounding boxes discarded by NMS.
        """
        # NOTE: when test, remember
        # faster_rcnn.eval()
        # to set self.traing = False
        if self.parent_model.training:
            n_pre_nms = self.n_train_pre_nms
            n_post_nms = self.n_train_post_nms
        else:
            n_pre_nms = self.n_test_pre_nms
            n_post_nms = self.n_test_post_nms

        # Convert anchors into proposal via bbox transformations.
        # roi = loc2bbox(anchor, loc)
        roi = deformat_loc(anchor, loc)

        # Clip predicted boxes to image.
        roi[:, [0,2]] = np.clip(roi[:, [0,2]], a_min=0, a_max=img_size[0]) # x [0 ~ 800] width
        roi[:, [1,3]] = np.clip(roi[:, [1,3]], a_min=0, a_max=img_size[1]) # y [0 ~ 800] height

        w = roi[:, 2] - roi[:, 0]
        h = roi[:, 3] - roi[:, 1]

        # Remove predicted boxes with either height or width < threshold.
        min_size = self.min_size * scale

        keep = np.where((h >= min_size) & (w >= min_size))[0]
        roi = roi[keep, :]
        score = score[keep]

        # Sort all (proposal, score) pairs by score from highest to lowest.
        # Take top pre_nms_topN (e.g. 6000).
        order = score.ravel().argsort()[::-1]
        if n_pre_nms > 0:
            order = order[:n_pre_nms]
        roi = roi[order, :]
        score = score[order]

        # Apply nms (e.g. threshold = 0.7).
        # Take after_nms_topN (e.g. 300).

        # unNOTE: somthing is wrong here!
        # TODO: remove cuda.to_gpu
        keep = nms(
            torch.from_numpy(roi).cuda(),
            torch.from_numpy(score).cuda(),
            self.nms_thresh)
        if n_post_nms > 0:
            keep = keep[:n_post_nms]
        roi = roi[keep] #.cpu().numpy()
        return roi

class ProposalTargetCreator(object):
    """Assign ground truth bounding boxes to given RoIs.
    The :meth:`__call__` of this class generates training targets
    for each object proposal.
    This is used to train Faster RCNN [#]_.
    .. [#] Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun. \
    Faster R-CNN: Towards Real-Time Object Detection with \
    Region Proposal Networks. NIPS 2015.
    Args:
        n_sample (int): The number of sampled regions.
        pos_ratio (float): Fraction of regions that is labeled as a
            foreground.
        pos_iou_thresh (float): IoU threshold for a RoI to be considered as a
            foreground.
        neg_iou_thresh_hi (float): RoI is considered to be the background
            if IoU is in
            [:obj:`neg_iou_thresh_hi`, :obj:`neg_iou_thresh_hi`).
        neg_iou_thresh_lo (float): See above.
    """

    def __init__(self,
                 n_sample=128,
                 pos_ratio=0.25, pos_iou_thresh=0.5,
                 neg_iou_thresh_hi=0.5, neg_iou_thresh_lo=0.0
                 ):
        self.n_sample = n_sample
        self.pos_ratio = pos_ratio
        self.pos_iou_thresh = pos_iou_thresh
        self.neg_iou_thresh_hi = neg_iou_thresh_hi
        self.neg_iou_thresh_lo = neg_iou_thresh_lo  # NOTE:default 0.1 in py-faster-rcnn

    def __call__(self, roi, bbox, label,
                 loc_normalize_mean=(0., 0., 0., 0.),
                 loc_normalize_std=(0.1, 0.1, 0.2, 0.2)):
        """Assigns ground truth to sampled proposals.
        This function samples total of :obj:`self.n_sample` RoIs
        from the combination of :obj:`roi` and :obj:`bbox`.
        The RoIs are assigned with the ground truth class labels as well as
        bounding box offsets and scales to match the ground truth bounding
        boxes. As many as :obj:`pos_ratio * self.n_sample` RoIs are
        sampled as foregrounds.
        Offsets and scales of bounding boxes are calculated using
        :func:`model.utils.bbox_tools.bbox2loc`.
        Also, types of input arrays and output arrays are same.
        Here are notations.
        * :math:`S` is the total number of sampled RoIs, which equals \
            :obj:`self.n_sample`.
        * :math:`L` is number of object classes possibly including the \
            background.
        Args:
            roi (array): Region of Interests (RoIs) from which we sample.
                Its shape is :math:`(R, 4)`
            bbox (array): The coordinates of ground truth bounding boxes.
                Its shape is :math:`(R', 4)`.
            label (array): Ground truth bounding box labels. Its shape
                is :math:`(R',)`. Its range is :math:`[0, L - 1]`, where
                :math:`L` is the number of foreground classes.
            loc_normalize_mean (tuple of four floats): Mean values to normalize
                coordinates of bouding boxes.
            loc_normalize_std (tupler of four floats): Standard deviation of
                the coordinates of bounding boxes.
        Returns:
            (array, array, array):
            * **sample_roi**: Regions of interests that are sampled. \
                Its shape is :math:`(S, 4)`.
            * **gt_roi_loc**: Offsets and scales to match \
                the sampled RoIs to the ground truth bounding boxes. \
                Its shape is :math:`(S, 4)`.
            * **gt_roi_label**: Labels assigned to sampled RoIs. Its shape is \
                :math:`(S,)`. Its range is :math:`[0, L]`. The label with \
                value 0 is the background.
        """
        n_bbox, _ = bbox.shape

        roi = np.concatenate((roi, bbox), axis=0)

        pos_roi_per_image = np.round(self.n_sample * self.pos_ratio)
        iou = bbox_iou(roi, bbox)
        gt_assignment = iou.argmax(axis=1)
        max_iou = iou.max(axis=1)
        # Offset range of classes from [0, n_fg_class - 1] to [1, n_fg_class].
        # The label with value 0 is the background.
        gt_roi_label = label[gt_assignment] + 1

        # Select foreground RoIs as those with >= pos_iou_thresh IoU.
        pos_index = np.where(max_iou >= self.pos_iou_thresh)[0]
        pos_roi_per_this_image = int(min(pos_roi_per_image, pos_index.size))
        if pos_index.size > 0:
            pos_index = np.random.choice(
                pos_index, size=pos_roi_per_this_image, replace=False)

        # Select background RoIs as those within
        # [neg_iou_thresh_lo, neg_iou_thresh_hi).
        neg_index = np.where((max_iou < self.neg_iou_thresh_hi) &
                             (max_iou >= self.neg_iou_thresh_lo))[0]
        neg_roi_per_this_image = self.n_sample - pos_roi_per_this_image
        neg_roi_per_this_image = int(min(neg_roi_per_this_image,
                                         neg_index.size))
        if neg_index.size > 0:
            neg_index = np.random.choice(
                neg_index, size=neg_roi_per_this_image, replace=False)

        # The indices that we're selecting (both positive and negative).
        keep_index = np.append(pos_index, neg_index)
        gt_roi_label = gt_roi_label[keep_index]
        gt_roi_label[pos_roi_per_this_image:] = 0  # negative labels --> 0
        sample_roi = roi[keep_index]

        # Compute offsets and scales to match sampled RoIs to the GTs.
        gt_roi_loc = format_loc(sample_roi, bbox[gt_assignment[keep_index]])
        gt_roi_loc = ((gt_roi_loc - np.array(loc_normalize_mean, np.float32)
                       ) / np.array(loc_normalize_std, np.float32))

        return sample_roi, gt_roi_loc, gt_roi_label

utils.py (faster rcnn from scratch code)

import numpy as np

"""
tools to convert specified type
"""
import torch as t
import numpy as np


def tonumpy(data):
    if isinstance(data, np.ndarray):
        return data
    if isinstance(data, t.Tensor):
        return data.detach().cpu().numpy()


def totensor(data, cuda=True):
    if isinstance(data, np.ndarray):
        tensor = t.from_numpy(data)
    if isinstance(data, t.Tensor):
        tensor = data.detach()
    if cuda:
        tensor = tensor.cuda()
    return tensor


def scalar(data):
    if isinstance(data, np.ndarray):
        return data.reshape(1)[0]
    if isinstance(data, t.Tensor):
        return data.item()

def generate_anchors(image_size, sub_sample=16, anchor_scale=[8,16,32], ratio=[0.5,1,2]):
    len_ratio = len(ratio)
    anchor_base = np.zeros((len(anchor_scale)*len_ratio, 4)) # 9x4

    for idx, scale in enumerate(anchor_scale):
        w = scale / np.sqrt(ratio) * sub_sample
        h = scale * np.sqrt(ratio) * sub_sample
        x1, y1, x2, y2 = -w/2, -h/2, w/2, h/2

        anchor_base[idx*len_ratio:(idx+1)*len_ratio] = np.c_[x1, y1, x2, y2]

    feature_map_size = image_size[0] // sub_sample, image_size[1] // sub_sample
    ctr_x = np.arange(sub_sample//2, image_size[0], sub_sample)
    ctr_y = np.arange(sub_sample//2, image_size[1], sub_sample)

    ctr = np.zeros((*feature_map_size, 2))
    for idx, y in enumerate(ctr_y):
        ctr[idx, :, 0] = ctr_x
        ctr[idx, :, 1] = y

    anchors = np.zeros((*feature_map_size, *anchor_base.shape))
    for idx_x in range(feature_map_size[0]):
        for idx_y in range(feature_map_size[1]):
            anchors[idx_x, idx_y] = (ctr[idx_x, idx_y] + anchor_base.reshape(-1, 2, 2)).reshape(-1, 4)

    return anchors.reshape(-1, 4)

# bbox iou 계산, (num_of_boxes1, 4) x (num_of_boxes2, 4)
# bboxes_1: anchor, bboxes_2: target box
# shape : x1 x2 y1 y2
def bbox_iou(bboxes_1, bboxes_2):
    len_bboxes_1 = bboxes_1.shape[0]
    len_bboxes_2 = bboxes_2.shape[0]
    ious = np.zeros((len_bboxes_1, len_bboxes_2))

    for idx, bbox_1 in enumerate(bboxes_1):
        yy1_max = np.maximum(bbox_1[1], bboxes_2[:, 1])
        xx1_max = np.maximum(bbox_1[0], bboxes_2[:, 0])
        yy2_min = np.minimum(bbox_1[3], bboxes_2[:, 3])
        xx2_min = np.minimum(bbox_1[2], bboxes_2[:, 2])

        height = np.maximum(0.0, yy2_min - yy1_max)
        width = np.maximum(0.0, xx2_min - xx1_max)

        eps = np.finfo(np.float32).eps
        inter = height * width
        union = (bbox_1[3] - bbox_1[1]) * (bbox_1[2] - bbox_1[0]) + \
                (bboxes_2[:, 3] - bboxes_2[:, 1]) * (bboxes_2[:, 2] - bboxes_2[:, 0]) - inter + eps
        iou = inter / union
        ious[idx] = iou

    return ious # ious (num_of_boxes1, num_of_boxes2)


# (x1, y1, x2, y2) -> (x, y, w, h) -> (dx, dy, dw, dh)
'''
t_{x} = (x - x_{a})/w_{a}
t_{y} = (y - y_{a})/h_{a}
t_{w} = log(w/ w_a)
t_{h} = log(h/ h_a)
anchors are the anchors
base_anchors are the boxes
'''
def format_loc(anchors, base_anchors):
    width = anchors[:, 2] - anchors[:, 0]
    height = anchors[:, 3] - anchors[:, 1]
    ctr_x = anchors[:, 0] + width*0.5
    ctr_y = anchors[:, 1] + height*0.5

    base_width = base_anchors[:, 2] - base_anchors[:, 0]
    base_height = base_anchors[:, 3] - base_anchors[:, 1]
    base_ctr_x = base_anchors[:, 0] + base_width*0.5
    base_ctr_y = base_anchors[:, 1] + base_height*0.5

    eps = np.finfo(np.float32).eps
    height = np.maximum(eps, height)
    width = np.maximum(eps, width)

    dx = (base_ctr_x - ctr_x) / width
    dy = (base_ctr_y - ctr_y) / height
    dw = np.log(base_width / width)
    dh = np.log(base_height / height)

    anchor_loc_target = np.stack((dx, dy, dw, dh), axis=1)
    return anchor_loc_target


# (dx, dy, dw, dh) -> (x, y, w, h) -> (x1, y1, x2, y2)
'''
anchors are the default anchors
formatted_base_anchors are the boxes with (dy, dx, dh, dw)
'''
def deformat_loc(anchors, formatted_base_anchor):
    width = anchors[:, 2] - anchors[:, 0]
    height = anchors[:, 3] - anchors[:, 1]
    ctr_x = anchors[:, 0] + width*0.5
    ctr_y = anchors[:, 1] + height*0.5
    
    dx, dy, dw, dh = formatted_base_anchor.T
    base_width = np.exp(dw) * width
    base_height = np.exp(dh) * height
    base_ctr_x = dx * width + ctr_x
    base_ctr_y = dy * height + ctr_y
    
    base_anchors = np.zeros_like(anchors)
    base_anchors[:, 0] = base_ctr_x - base_width*0.5
    base_anchors[:, 1] = base_ctr_y - base_height*0.5
    base_anchors[:, 2] = base_ctr_x + base_width*0.5
    base_anchors[:, 3] = base_ctr_y + base_height*0.5
    
    return base_anchors

# non-maximum-suppression
def nms(rois, scores, nms_thresh):
    # print(scores, scores.shape)
    order = (-scores).argsort().cpu().data.numpy()#[::-1]
    # x1, y1, x2, y2 = rois.T
    rois = rois.cpu().data.numpy()
    
    keep_index = []
    # print(order.size)
    while order.size > 0:
        i = order[0]
        keep_index.append(i)
        ious = bbox_iou(rois[i][np.newaxis, :], rois[order[1:]])
        inds = np.where(ious <= nms_thresh)[1]
        order = order[inds + 1]
    return np.asarray(keep_index)

며칠 코드짜고 보고 확인하면서 멘탈이 나가버렸습니다.

저는 AI하기엔 머리가 너무 analog인가봅니다...

Ref.

Faster R-CNN Article

Faster R-CNN pytorch from scratch (KrisHan999)

Faster R-CNN tensor flow Github (ganghee-lee)

Simple Faster R-CNN Pytorch (chenyuntc)

저작자표시 (새창열림)

'데이터 분석 > 딥러닝' 카테고리의 다른 글

[YoLo v1] 논문 리뷰 & 구현 (Pytorch) (13)	2021.01.01
[R-CNN] 논문 리뷰 & 구현 (Pytorch) (0)	2020.12.23
[SeNet] 논문 리뷰 & 구현 (Pytorch) (3)	2020.12.22

현재글[Faster R-CNN] 논문 리뷰 & 구현 (Pytorch)

소신의 블로그생활