microsoft/AI-For-Beginners
Publicmirrored fromhttps://github.com/microsoft/AI-For-BeginnersAvailable
lessons/4-ComputerVision/11-ObjectDetection/lab/README.md
63lines · modecode
| 1 | # Head Detection using Hollywood Heads Dataset |
| 2 | |
| 3 | Lab Assignment from [AI for Beginners Curriculum](https://github.com/microsoft/ai-for-beginners). |
| 4 | |
| 5 | ## Task |
| 6 | |
| 7 | Counting number of people on video surveillance camera stream is an important task that will allow us to estimate the number of visitors in a shops, busy hours in a restaurant, etc. To solve this task, we need to be able to detect human heads from different angles. To train object detection model to detect human heads, we can use [Hollywood Heads Dataset](https://www.di.ens.fr/willow/research/headdetection/). |
| 8 | |
| 9 | ## The Dataset |
| 10 | |
| 11 | [Hollywood Heads Dataset](https://www.di.ens.fr/willow/research/headdetection/release/HollywoodHeads.zip) contains 369,846 human heads annotated in 224,740 movie frames from Hollywood movies. It is provided in [https://host.robots.ox.ac.uk/pascal/VOC/](PASCAL VOC) format, where for each image there is also an XML description file that looks like this: |
| 12 | |
| 13 | ```xml |
| 14 | <annotation> |
| 15 | <folder>HollywoodHeads</folder> |
| 16 | <filename>mov_021_149390.jpeg</filename> |
| 17 | <source> |
| 18 | <database>HollywoodHeads 2015 Database</database> |
| 19 | <annotation>HollywoodHeads 2015</annotation> |
| 20 | <image>WILLOW</image> |
| 21 | </source> |
| 22 | <size> |
| 23 | <width>608</width> |
| 24 | <height>320</height> |
| 25 | <depth>3</depth> |
| 26 | </size> |
| 27 | <segmented>0</segmented> |
| 28 | <object> |
| 29 | <name>head</name> |
| 30 | <bndbox> |
| 31 | <xmin>201</xmin> |
| 32 | <ymin>1</ymin> |
| 33 | <xmax>480</xmax> |
| 34 | <ymax>263</ymax> |
| 35 | </bndbox> |
| 36 | <difficult>0</difficult> |
| 37 | </object> |
| 38 | <object> |
| 39 | <name>head</name> |
| 40 | <bndbox> |
| 41 | <xmin>3</xmin> |
| 42 | <ymin>4</ymin> |
| 43 | <xmax>241</xmax> |
| 44 | <ymax>285</ymax> |
| 45 | </bndbox> |
| 46 | <difficult>0</difficult> |
| 47 | </object> |
| 48 | </annotation> |
| 49 | ``` |
| 50 | |
| 51 | In this dataset, there is only one class of objects `head`, and for each head, you get the coordinates of the bounding box. You can parse XML using Python libraries, or use [this library](https://pypi.org/project/pascal-voc/) to deal directly with PASCAL VOC format. |
| 52 | |
| 53 | ## Training Object Detection |
| 54 | |
| 55 | You can train an object detection model using one of the following ways: |
| 56 | |
| 57 | * Using [Azure Custom Vision](https://docs.microsoft.com/azure/cognitive-services/custom-vision-service/quickstarts/object-detection?tabs=visual-studio&WT.mc_id=academic-57639-dmitryso) and it's Python API to programmatically train the model in the cloud. Custom vision will not be able to use more than a few hundred images for training the model, so you may need to limit the dataset. |
| 58 | * Using the example from [Keras tutorial](https://keras.io/examples/vision/retinanet/) to train RetunaNet model. |
| 59 | * Using [torchvision.models.detection.RetinaNet](https://pytorch.org/vision/stable/_modules/torchvision/models/detection/retinanet.html) build-in module in torchvision. |
| 60 | |
| 61 | ## Takeaway |
| 62 | |
| 63 | Object detection is a task that is frequently required in industry. While there are some services that can be used to perform object detection (such as [Azure Custom Vision](https://docs.microsoft.com/azure/cognitive-services/custom-vision-service/quickstarts/object-detection?tabs=visual-studio&WT.mc_id=academic-57639-dmitryso)), it is important to understand how object detection works and to be able to train your own models. |