Accuracy of Released Image Classification Models #59

ahkarami · 2017-12-06T07:33:00Z

Dear @rkrasin,
Thank you for your fantastic Data Set. Would you please kindly inform us the accuracy (i.e., F1-Score) of your released models (i.e., ResNet 101 image classification model and Inception V3 model) on validation set?

rkrasin · 2017-12-06T07:36:50Z

Hi @ahkarami,

thank you for your kind words. While I played my part, the dataset is a work of many people, and many thanks should also go to the Google management that funded the work and gave the permission to release that much internal knowledge.

As for the accuracy of the most recently released models, I think @nalldrin has the most to say. After all, I left Google in May 2017, and no longer in the loop.

ahkarami · 2017-12-06T07:45:17Z

Dear @rkrasin,
Thank you for your time and response.

rkrasin · 2017-12-06T07:48:59Z

@ahkarami after some thought, the answer to your question can be obtained in a dyi fashion:

Download validation set images from Google Cloud Storage, as described in: https://github.com/cvdfoundation/open-images-dataset#download-images-with-bounding-boxes-annotations
Run the classifier on each of the image (optionally, with a few random crops)
Compute the score you want.

I will do it myself on the weekend, unless someone gets the score published before that.

rkrasin · 2017-12-11T03:32:19Z

So, as promised, I have started doing so, and realized that the images are not actually publicly released. The registration form at CVDF website has to be submitted before any pixels are released.

I plan to document the steps required to get the access for the benefit of everyone else. The alternative way is to use Google Cloud Transfer Service. That will take about a week, so I will first try the CVDF hosted option.

The first thing I did is to fill out the form as below:

Name: Ivan Krasin
Organization: Myself
Gmail / Gmail Associated Account: imkrasin@gmail.com
Usage of Dataset (e.g. training object detector for academia research): unrestricted use

rkrasin · 2017-12-11T03:33:41Z

Actually, I already have the email that says the request is approved:

Hello,

Thanks for signing up to download Open Images Dataset. This is a confirmation that you successfully register to download the dataset.Please follow the instruction on the Github page to download the dataset.

All images have a CC BY 2.0 license. Note that CVDF does not own the copyright of images and the image license is subject to change by the image owner.

If you have any question, please contact info@cvdfoundation.org

Best,
Common Visual Data Foundation

Nice!

rkrasin · 2017-12-11T04:32:17Z

And indeed, the access works:

$ gsutil ls gs://open-images-dataset/
gs://open-images-dataset/test/
gs://open-images-dataset/train/
gs://open-images-dataset/validation/

$ gsutil ls gs://open-images-dataset/test/*.jpg | wc -l
125436

Then I have copied the images on my Google Compute Engine instance. Not the most optimized chain of command-line commands, but here we are:

$ mkdir -p ~/out/openimages/{test,validation}
$ cd ~/out/openimages/test
$ gsutil -m rsync gs://open-images-dataset/test/ .
<wait for 15 minutes>
$ ls | wc -l
125436
$ cd ~/out/openimages/validation
$ gsutil -m rsync gs://open-images-dataset/validation/ .
<wait for 5 minutes>
$ ls | wc -l
41620

Note that test + validation sets are 50 GB in total.

The next step is to run both classifiers (v1 and v2) on all images and produce two CSV files for later analysis. I will probably skip all further details until the scores are obtained. The critical step, getting the images, confirmed to be working.

nalldrin · 2017-12-11T17:48:48Z

Hi, sorry for not noticing this thread sooner. A few points of clarification. First, we are providing the image classification checkpoints more as a courtesy than to serve as a competitive baseline model and we haven't provided any official guidance as to how to evaluate them. That said, internally we tracked the AP and mAP scores over the validation+test sets: .587 and .506 respectively for the v2 checkpoint. You should be able to approximately reproduce this by calculating it yourself over the test+val sets.

btaba · 2018-03-13T22:38:45Z

For what it's worth, I downloaded the v2 checkpoint:

#! /bin/bash

mkdir -p /tmp/open_image/checkpoint/
wget -nc -nv -O /tmp/open_image/checkpoint/oidv2-resnet_v1_101.ckpt.tar.gz\
    https://storage.googleapis.com/openimages/2017_07/oidv2-resnet_v1_101.ckpt.tar.gz

cd /tmp/open_image/checkpoint
tar -xvf oidv2-resnet_v1_101.ckpt.tar.gz

and simply ran these metrics on the predictions:

    from sklearn import metrics
    metric_dict['micro_map'] = metrics.average_precision_score(labels.ravel(), scores.ravel())
    average_precision = [metrics.average_precision_score(labels[:, i], scores[:, i]) for i in range(scores.shape[-1])]
    average_precision = [np.nan_to_num(a) for a in average_precision]
    metric_dict['macro_map'] = np.mean(average_precision)

and I got losses and predictions with something like this:

    sess = tf.Session()
    g = tf.get_default_graph()
    with g.as_default():
        saver = tf.train.import_meta_graph(checkpoint_path + '.meta')
        saver.restore(sess, checkpoint_path)

        input_values = g.get_tensor_by_name('input_values:0')
        predictions = g.get_tensor_by_name('multi_predictions:0')
        true_labels = tf.placeholder(
            dtype=tf.float32, shape=[None, 5000], name='labels')

        # using a TF Record generator here...
        data = data_generator(_SPLITS_TO_PATHS[split], repeat=False, parser=parser)

        stored_predictions = []
        stored_labels = []
        for idx, v in enumerate(data):
            images = v['image']
            labels = v['label']
            preds = sess.run(predictions, feed_dict={
                input_values: images,
                true_labels: labels
            })

            stored_predictions.append(preds)
            stored_labels.append(labels)

I get AP of 0.423 and 0.424 on the validation and test sets respectively, and mAP of 0.088 and 0.0892. Would be good if someone else can confirm or validate these results/code?

I was easily able to fine-tune the last layer of InceptionV3 and achieve mAP of 0.541 and 0.542 on validation/test and AP of 0.447 and 0.433.

ahkarami · 2018-03-14T17:51:46Z

Dear @btaba,
Thank you very much for your answer. I think your work is really valuable. However, I think your obtained results (especially mAP) is so low. As I remembered, I got F1-Score ~ 34.2% after only 15 epochs of training a customized ResNet50 model on the validation set (but with my own special training code on PyTorch, which I have used many tricks in it). By the way, your work is really valuable.
As Open Images is a Multi-Label Image Classification data set, so I recommend you to report your results via F1-Score. You can easily report it via sklearn library, something like below:

from sklearn.metrics import fbeta_score
beta = 1.0
fbeta_score(true_labels, predictions > threshold, beta, average='samples')

btaba · 2018-03-15T14:46:39Z

@ahkarami I also computed F1 for a threshold of 0.5, but I think AP and mAP may be more useful since they don't depend on the threshold. Nevertheless, I get 0.423/0.424 on validation/test for micro F1 and 0.088/0.089 for macro F1 for the resnet checkpoint. For the fine-tuned Inception v3 I get 0.515/0.515 and 0.372/0.377 for micro and macro F1 respectively.

ahkarami · 2018-03-16T19:18:44Z

Dear @btaba,
Thank you for your useful information. Just as a note, I remembered that threshold of 0.3 would be better for the ResNet checkpoint.
Thank you

mylyu · 2018-10-10T15:10:18Z

@ahkarami I also computed F1 for a threshold of 0.5, but I think AP and mAP may be more useful since they don't depend on the threshold. Nevertheless, I get 0.423/0.424 on validation/test for micro F1 and 0.088/0.089 for macro F1 for the resnet checkpoint. For the fine-tuned Inception v3 I get 0.515/0.515 and 0.372/0.377 for micro and macro F1 respectively.

Thanks for your test. Was "the fine-tuned Inception v3" trained on the validation dataset or train dataset? How many classes could it predict? I wonder why it could achieve much higher mAP than the released resnet101 model.
Or rather, the mAP of resnet101 model seems too low.

Perhaps because many rare classes were missing in the val/test set and nan_to_num gave many zeros, which reduced the mean value significantly? But this cannot explain why your finetuned inception v3 was alright.

mylyu · 2018-10-10T15:17:30Z

Hi, sorry for not noticing this thread sooner. A few points of clarification. First, we are providing the image classification checkpoints more as a courtesy than to serve as a competitive baseline model and we haven't provided any official guidance as to how to evaluate them. That said, internally we tracked the AP and mAP scores over the validation+test sets: .587 and .506 respectively for the v2 checkpoint. You should be able to approximately reproduce this by calculating it yourself over the test+val sets.

Could you comment on why @btaba got a much smaller mAP? Used different mAP definition? Thanks.

nalldrin · 2018-10-10T16:10:46Z

Just a note on the released checkpoint: it was trained primarily using the machine generated labels (I used human verifications if available, otherwise machine predictions). No fine-tuning on human verifications was performed. So it's easy to see how the mAP and other metrics can be massively improved by a little fine-tuning.

…

-Neil

On Wed, Oct 10, 2018, 8:17 AM Mengye LYU ***@***.***> wrote: Hi, sorry for not noticing this thread sooner. A few points of clarification. First, we are providing the image classification checkpoints more as a courtesy than to serve as a competitive baseline model and we haven't provided any official guidance as to how to evaluate them. That said, internally we tracked the AP and mAP scores over the validation+test sets: .587 and .506 respectively for the v2 checkpoint. You should be able to approximately reproduce this by calculating it yourself over the test+val sets. Could you comment on why @btaba <https://github.com/btaba> got a much smaller mAP? Used different mAP definition? Thanks. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#59 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/Aa9xhSpSW5INUAlBQ4fdEQYfumR8QKqQks5ujg-UgaJpZM4Q3fKg> .

btaba · 2018-10-11T15:40:52Z

@nalldrin makes sense.

@mylyu I trained on the train set exclusively. Predicted 5000 classes, but I believe the whole dataset was updated recently.

MyYaYa · 2019-01-08T02:54:37Z

Dear @btaba,
Thank you very much for your answer. I think your work is really valuable. However, I think your obtained results (especially mAP) is so low. As I remembered, I got F1-Score ~ 34.2% after only 15 epochs of training a customized ResNet50 model on the validation set (but with my own special training code on PyTorch, which I have used many tricks in it). By the way, your work is really valuable.
As Open Images is a Multi-Label Image Classification data set, so I recommend you to report your results via F1-Score. You can easily report it via sklearn library, something like below:
from sklearn.metrics import fbeta_score
beta = 1.0
fbeta_score(true_labels, predictions > threshold, beta, average='samples')

I'm also doing training on the openimage v4 dataset with res50 for multi-label classification, but I got terrible results whose mAP is very low.
I use resnet50_v2 and sigmoid cross entropy for loss.
Can you provide any useful tips for training or hyperparameters for training? Thanks!

ahkarami · 2019-01-08T18:40:47Z

Dear @MyYaYa,
I am really sorry. I can't share any more notes, because of commercial issues (because that was one of my tasks when I was at Sensifai Company).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accuracy of Released Image Classification Models #59

Accuracy of Released Image Classification Models #59

ahkarami commented Dec 6, 2017

rkrasin commented Dec 6, 2017 •

edited

Loading

ahkarami commented Dec 6, 2017

rkrasin commented Dec 6, 2017

rkrasin commented Dec 11, 2017 •

edited

Loading

rkrasin commented Dec 11, 2017 •

edited

Loading

rkrasin commented Dec 11, 2017

nalldrin commented Dec 11, 2017

btaba commented Mar 13, 2018 •

edited

Loading

ahkarami commented Mar 14, 2018 •

edited

Loading

btaba commented Mar 15, 2018

ahkarami commented Mar 16, 2018

mylyu commented Oct 10, 2018 •

edited

Loading

mylyu commented Oct 10, 2018

nalldrin commented Oct 10, 2018 via email

btaba commented Oct 11, 2018

MyYaYa commented Jan 8, 2019

ahkarami commented Jan 8, 2019

Accuracy of Released Image Classification Models #59

Accuracy of Released Image Classification Models #59

Comments

ahkarami commented Dec 6, 2017

rkrasin commented Dec 6, 2017 • edited Loading

ahkarami commented Dec 6, 2017

rkrasin commented Dec 6, 2017

rkrasin commented Dec 11, 2017 • edited Loading

rkrasin commented Dec 11, 2017 • edited Loading

rkrasin commented Dec 11, 2017

nalldrin commented Dec 11, 2017

btaba commented Mar 13, 2018 • edited Loading

ahkarami commented Mar 14, 2018 • edited Loading

btaba commented Mar 15, 2018

ahkarami commented Mar 16, 2018

mylyu commented Oct 10, 2018 • edited Loading

mylyu commented Oct 10, 2018

nalldrin commented Oct 10, 2018 via email

btaba commented Oct 11, 2018

MyYaYa commented Jan 8, 2019

ahkarami commented Jan 8, 2019

rkrasin commented Dec 6, 2017 •

edited

Loading

rkrasin commented Dec 11, 2017 •

edited

Loading

rkrasin commented Dec 11, 2017 •

edited

Loading

btaba commented Mar 13, 2018 •

edited

Loading

ahkarami commented Mar 14, 2018 •

edited

Loading

mylyu commented Oct 10, 2018 •

edited

Loading