Objective Evaluation of Approaches of Skin Detection using ROC Analysis
abstract | dataset | results | download | people
Skin detection is an important indicator of
human presence and actions in many domains, including interaction, interfaces
and security. It is commonly performed in three steps: transforming the pixel
color to a non-RGB colorspace, dropping the illuminance component of skin color,
and classifying by modeling the skin color distribution. In this paper, we
evaluate the effect of these three steps on the skin
detection performance. The importance of this study is a new comprehensive
colorspace and color modeling testing methodology that would allow for making
the best choices for skin detection. Combinations of nine colorspaces, the
presence or the absence of the illuminance component, and the two color modeling
approaches are compared for different settings (indoor or outdoor) and modeling
parameters
(the histogram size). The performance is measured by using a receiver operating
characteristic (ROC) curve on a large dataset of 845 images (consisting more
than 18.6 million pixels) with manual ground truth. The results reveal that (1)
colorspace transformations can improve performance in certain instances, (2) the
absence of the illuminance component decreases performance, and (3) skin color
modeling has a greater impact than colorspace transformation. We found that the
best performance was obtained on indoor images by transforming the pixel color
to the HSI or SCT colorspaces, keeping the illuminance component, and modeling
the color with the histogram approach using a larger size distribution.
A dataset of 845 images with 18.6 million pixels was used to compute the performance. The dataset was composed of 4.9 million pixels of skin pixels and 13.7 million pixels without skin pixels. The images with skin pixels were collected from the AR face dataset, the UOPB dataset, and University of Chile dataset. We collected images completely without skin randomly from the University of Washington content-based image retrieval database. Below is a sample of the images from each dataset.
AR dataset (indoor images with different level of lighting)
![]() |
![]() |
![]() |
| no extra | extra on right | extra on both sides |
UOPB dataset (indoor images with different lighting materials)
![]() |
![]() |
![]() |
![]() |
| incandescent light | daylight | horizon light | fluorescent light |
University of Washington (outdoor images of non-skin pixels)

University of Chile (Various outdoor people scenes from web and digitized movie clips)

Ground Truth
The ground truth (GT) is defined at the pixel-level. The three labels are used to label pixels as skin (black), non-skin (white), or don't-care (gray). 'Don't care' label is assigned to pixels that are too ambiguous or tedious to label as either skin or non-skin. Below is a sample of some of the ground truth.
![]() |
![]() |
![]() |
![]() |
The data was divided into 10 train/test folds. For each fold 90 % of the data is used for training the classifiers and 10% is for testing the performance. We evaluate the performance of the classifier by counting the number of true positives and false positives for several different threshold parameters of the classifier. From the performance of each threshold, we construct a Receiver Operator Characteristic (ROC) curve and compute the area under the curve as the performance. We do this for each fold and each classifier/colorspace/with-illuminance component,/without-illumminance component combination. We compute the average testing AUC of the 10 folds for each combination. For histogram, we select the bin size with the highest average training performance..
Below is a table showing the area under the curve (AUC) for each combination. 1.0 AUC is perfect and 0.0 is the worst. Beside the AUC is the ranking of the 36 combinations
|
Normal |
Histogram |
|||
| 3D | 2D | 3D | 2D | |
| CIELAB | 0.889582 (18) | 0.899549 (12) | 0.907608 (6) | 0.894391 (16) |
| CIEXYZ | 0.861596 (27) | 0.848372 (35) | 0.894707 (15) | 0.876083 (20) |
| HSI | 0.843751 (36) | 0.85416 (31) | 0.947461 (1) | 0.939184 (2) |
| NRGB | 0.874557 (22) | 0.878295 (19) | 0.89405 (17) | 0.897082 (14) |
| RGB | 0.862084 (26) | 0.875932 (21) | 0.89825 (13) | 0.904549 (10) |
| SCT | 0.91291 (4) | 0.905784 (9) | 0.931799 (3) | 0.912142 (5) |
| YCbCr | 0.862135 (24) | 0.851044 (32) | 0.906634 (7) | 0.855645 (28) |
| YIQ | 0.862089 (25) | 0.850979 (34) | 0.901814 (11) | 0.855302 (30) |
| YUV | 0.862159 (23) | 0.851041(33) | 0.906622 (8) | 0.855598 (29) |
Below is two ROC curves of the best performing combination in red (HSI 3D histogram), and the worst in green (HSI 3D Normal).
