Processing math: 100%
본문 바로가기
AI/포즈 추정

Crowdpose: Efficient crowded scenes pose estimation and a new benchmark

by 구운밤이다 2021. 10. 17.
728x90
반응형

최근에 저는 PoseTrack Dataset으로 실험을 진행해보고 있는데, video 데이터에서 어떻게 하면 pose estimation을 좀 더 잘 할 수 있을까 생각해보다가 Crowdpose: Efficient crowded scenes pose estimation and a new benchmark 를 읽어보게 되었습니다.

이 논문에서는 사람이 많은 영상 속에서 포즈를 잘 예측하기 위해 싱글 person 포즈 estimation에 joint candidate loss를 적용하고 global association 그래프 모델을 활용하여 joint candidate 중 맞는 joint를 골라 포즈를 estimation합니다.

먼저 일반 싱글 person pose estimation은 receptive fields 가 제한되어 crowd scene에서 좋은 성능을 내지 못합니다. 따라서 여기선 joint candidate을 활용하는데 i번째 사람의 proposal인 region Ri를 SPPE에 넣고 히트맵 pi를 뽑습니다. 이때, Ri는 i번째 사람에 속하는 joint와 속하지 않는 joint 두 가지 타입의 joint가 존재합니다.(이를 타겟 조인트와 interference joint라고 부르는데 여기서 interferencejoint는 다른 사람의 타겟 조인트 일수 있으므로 직접 제한을 걸진 않고 다른 강도를 줘서 출력하게 됩니다.)

i번 사람의 조인트에 대해 타겟 조인트 히트맵 Tki를 만들고 간섭 조인트에 대해서도 히트맵 Cki를 만듭니다. 이후 최종 loss를 이렇게 정의하는데 뮤는 실험시(cross vali을 통해) 0.5로 정했다고 합니다.

이후 이 joint candidate 중 실제 joint 를 추출하기 위해 person joint graph 를 활용합니다. Joint node는 P1와 p2사이의 거리가 해당 조인트의 히트맵의 u1 u2 즉 가우시안 response 크기의 최소값에 deviation을 곱한 값보다 작다는 조건을 만족하면 같은 조인트를 예측하는 그룹으로 만듭니다. 이 그룹을 한개 조인트 노드로 만들어 휴먼 디텍터에 의해 생긴 사람 노드에 후보 joint들의 response score (아마 히트맵 score일듯)로 weight를 주어 연결합니다.

이 그래프를 푸는 것은 각 엣지의 weight들의 합을 maximize하는 것인데, 이는 각 종류별 조인트의 weight를 최대화 하는 문제로 decompose 할 수 있습니다.

논문 요약

Joint-Candidates SPPE(single person pose estimator)

Loss Design

For the ith human proposal, we input its region Ri into our SPPE network and get the output heatmap Pi. There are two types of joints in Ri, that is, the joints belong to the ith person, and the joints belong to other human instances (not the ith person). name them as target joints and interference joints respectively.

Our goal is to enhance target joints response and suppress interference joints response. However, we don’t suppress them directly since interference joints for the current proposal can be regarded as target joints for other proposals. Thus, we can leverage interference joints to estimate human poses with other human proposals in a global manner. Therefore, to utilize those two kinds of joint candidates, we output them with different intensities.

Heatmap Loss

For the kth joint in the ith person, we denote the target joint heatmap as Tki, consisting of a 2D Gaussian G(pki|σ), centered at the target joint location pki, with standard deviation σ.

For interference 간섭 joints, we denote them as a set Ωki. The heatmap of interference joints is denoted as Cki, consisting of a Gaussian mixture distribution ΣpΩkiG(p|σ). Our proposed loss is defined as,

where µ is an attenuation 감쇠 factor ranged in [0,1]. As aforementioned, interference joints will be useful in indicating joints of other human instances. Therefore, we should consider it in a global view by cross-validation. Finally, we have µ = 0.5, which fits our intuition: interference joints should be attenuated but not over-suppressed. The conventional heatmap loss function can be regarded as our special case where µ = 0.

limit A conventional SPPE depends on a high-quality human detection result. Its tasks are locating and identifying target joints according to the given human proposal. If SPPE mistakes interference joints for target joints, it will be an unrecoverable error. Missing joints cannot be restored in the post-processing step like pose-NMS.

Person-Joint Graph

Due to our joint-candidate mechanism and redundant human proposals from human detector, joint candidates are numerically much greater than the actual joint numbers. To reduce redundant joints, we build a person-joint graph and apply a maximum person-joint matching algorithm to construct the final human poses

  • fig 4. crowd scene ⇒ highly overlapped ⇒ tend to predict same actual joint. if we directly connect the highest response to build final poses, two human proposals will locate same right knee and right leg.

Joint Node Building

Since highly overlapped human proposals tend to predict the same actual joint (as Figure 4), we first group these candidates that represent the same actual joint as one joint node. Thanks to the high-quality joint prediction, candidate joints that indicate the same joint are always close to each other. Thus, we can group them using the following criterion: given two candidate joints located at pk1 and pk2 with control deviation δk, we label them as the same group, if

where uk1 and uk2 are the Gaussian response size of two joints on heatmaps, determined by the Gaussian response deviation. δ(k) is the parameter for controlling deviation of the kth joint, which we directly adopt from MSCOCO keypoint dataset [15]. The reason why we use min{u1,u2} rather than a constant threshold is to guarantee that, only if p1 and p2 fall into each others’ control domain (radii are uk1δk,uk2δk) simultaneously, we group them together. One node represents a group of joints that cluster together by the above criterion. Now, by building a joint group as one node, we have joint node set J = {vkj: for k{1,...,K},j{1,...,Nk}}, where Nk is the number of joint nodes of body part k, vkj is the jth node of body part k. The total number of joint nodes in J is ΣkNk.

Person Node Building

Person nodes represent the human proposals detected by human detector. We denote person node set as H={hi:i{1...M}}, where hi is the ith person node, and M is the number of detected human proposals. Ideally, a qualified human proposal tightly bounds a human instance. However, in crowded scenes, this condition is not always satisfied. The human detector will produce many redundant proposals, including truncated and incompact bounding boxes. We will eliminate these low-quality person nodes during global person-joint matching in Sec. 3.3.

Person-Joint Edge

After obtaining the node of both joints and persons, we connect them to construct our person-joint graph. For each person node hi , JC-SPPE will predict several candidate results of joints. If one of these candidates contributes to the joint node vkj, we build an edge eki,j between them. The weight of eki,j is the response score of that candidate joint, which is denoted as wki,j. In this way, we can construct the edge set E={eki,j:i,j,k}.

The person-joints graph can then be written as:

Globally Optimizing Association

goal - solving the above person-joint graph and maximizing the total edge weights.

Objective function as

where d(k)i,j indicates whether we keep the edge eki,j in our final graph or not. The constraints of Eq. 5 and 6 enforce that each human proposal can only match at most one kth joint.

Note that G can be decomposed into K sub-graph Gk=((H,J(k)),E(k)), where Gk is the sub-graph the only consist of the kth kind of joints. Thus, our objective function can be formulated as

As shown in Eq. 10, solving the global assignment problem in person-joint graph G is mathematically equivalent to solving its sub-graph Gk separately. Gk is a bipartite graph that composed of person subset and the kth joint subset. For each sub-graph, the updated Kuhn-Munkres algorithm [1] is applied to get the optimized result. By addressing each Gk respectively, we obtain the final result set R.

Given the graph matching result, if d(k)i,j=1 the weighted center of vkj is assigned to the ith human proposal as its k-th joint. Here, weighted center means the linear combination of candidate joints coordinate in vkj and the weights are their heatmap response scores. In this way, the pose of each human proposal can be constructed. The person nodes that can not match any joint will be removed.

Computational Complexity

The inference speed of pose estimation is essential in many applications. We prove that our global association algorithm is as efficient as common greedy NMS algorithms. As the hereditary property identified by White and Whiteley [2], a graph G is (k,l)sparse if every nonempty sub-graph X has at most k|X|l edges, where |X| is the number of vertices in sub-graph X and 0l<2k. Consider the sub-graph G(k)=((H,J(k)),E(k)). It represents the connection between human proposals and the k-th type of joints. According to our statistics (Fig. 5), every human bounding box covers four persons at most in crowded scenes. Therefore, one person node builds connection edges to 4 joints at most. In other words, our personjoint sub-graph G(k) is (4, 0) − sparse since

Due to the sparsity of our person-joint graph, we can solve the association problem efficiently. We transform E(k) into an adjacency matrix Mke (unconnected nodes refer to 0). According to the work of Carpaneto et al. [1], this linear assignment problem for the sparse matrix can be solved in O(n2), i.e., O((|H|+|J(k)|)2). Since we have eliminated the redundant joints and there is a one-to-one correspondence between joints and persons, the expectation of |J(k)| is equal to |H|. Thus we have O((|H|+|J(k)|)2)=O(|H|2). Such computation complexity is the same as the complexity of conventional greedy NMS algorithms.

  • fig 5. Instance-Joint connection distribution. The x-axis denote the number of human bounding boxes that cover a same joint. This statistical result is based on the ground truth annotations.

Discussion

Our method adopts the graph-based approach to associate joints with human proposals in a globally optimal manner. Human proposals compete with each other for joint nodes. In this way, unqualified human proposals without dominant human instance would fail to be assigned any joints, since their joint response scores are all relatively low due to missing dominant human instance. Therefore, many redundant and poor human proposals are rejected. In comparison to our approach, conventional NMS is a greedy and instance-based algorithm, which is less effective. Although [10, 17, 27] proposed pose-NMS to utilize pose information, their algorithms are based on instances and cannot tackle the missing joints and wrong assembling problem. Our globally optimizing association method can deal with such situations well.

Dataset

Crowding Level Definition - need to define a Crowd Index first, which measure the crowding level in a given image. number of persons in an image seems to be a good measurement. However, the principal obstacle to solving crowded cases is not caused by the number of persons, but rather by occlusion in a crowd. Therefore, we need a new Crowd Index to indicate crowding level. In the bounding box of the i-th human instance, we denote the number of joints that belonging to the i-th person and other (not i-th) persons as Nai and Nbi respectively. Nbi/Nai is the crowd ratio of the i-th human instance. Our Crowd Index is derived by averaging the crowd ratio of all persons in an image:

where n indicates the total number of persons in the image. We evaluate the Crowd Index distribution of three public benchmarks: MSCOCO (person subset), MPII and AI Challenger. As shown in Figure 6, uncrowded scenes dominate these benchmarks, which leads the SOTA methods only focus on these simple cases and ignore the crowded ones.

Experiments

  • Crowd Index: easy (0-0.1), medium (0.1-0.8) and hard (0.8-1)
  • human detector (YoloV3 [43])
  • pose estimation network provided by AlphaPose

training step, we adopt rotation (±30), scaling (±30%) and flipping data augmentation.

  • input resolution is 320 × 256 and the output heatmap resolution is 80 × 64.
  • learning rate is set to 1 × 10−4 and 1 × 10−5 after 80 epochs.
  • Mini-batch size is set to 64, and RMSprop [7] optimizer is used

testing, the detected human bounding boxes are first extended by 30% along both the height and width directions and then forwarded through the Joint-candidate SPPE

  • two Nvidia 1080Ti GPUs

Ablation Studies

replace our joint-candidate loss with mean square loss ⇒ 66.0% to 61.7%

728x90
반응형

댓글