Vision-and-Language on 行李の底に収めたり[YuWd]

Vision-and-Language on 行李の底に収めたり[YuWd] https://yuiga.dev/blog/en/tags/vision-and-language/ Recent content in Vision-and-Language on 行李の底に収めたり[YuWd] Hugo -- gohugo.io en ©2024, All Rights Reserved Thu, 24 Nov 2022 20:09:23 +0900 How to create Matterport3D segmentation images? https://yuiga.dev/blog/en/ja/posts/matterport3d_semantic_segmentation/ Thu, 24 Nov 2022 20:09:23 +0900 Thu, 24 Nov 2022 20:09:23 +0900 https://yuiga.dev/blog/en/ja/posts/matterport3d_semantic_segmentation/ Intro The other day, one of my labmates needed to make a segmentation of Matterport3D. He asked for help, and I got involved in creating the segmentation. However, it turned out to be a real struggle. We were not used to 3D mesh models. After several weeks, we completed the code to create a semantic segmentation image for Matterport3D. How to create Matterport3D segmentation images Matterport3D provides access to 3D segmentation but does not give users an easy way to access 2D. Matterport3D data only provides point clouds and meshes labeled by ground truth, and the user must add color directly to the point clouds and meshes to create 2D segmentations. We, therefore, wrote code using Matterport3DSimulator to place a camera for a given scan_id and viewpoint_id and create a segmentation from the original ply file. When we run our code, we get the following image. (I concatenated the obtained images and converted to a gif) Matterport3DSimulator takes a total of 36 pictures: 12 at the top, 12 at the perimeter, and 12 at the bottom. YuWd (Yuiga Wada) featured image Matterport3D python CV Vision-and-Language post Peter Anderson https://yuiga.dev/blog/en/ja/posts/peter_anderson/ Fri, 26 Aug 2022 19:57:39 +0900 Fri, 26 Aug 2022 19:57:39 +0900 https://yuiga.dev/blog/en/ja/posts/peter_anderson/ すげえ人 SPICE SPICE: Semantic Propositional Image Caption Evaluation REVERIE REVERIE - Remote Embodied Visual Referring Expression in Real Indoor Environments Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering Sim-to-Real Transfer for Vision-and-Language Navigation など, めちゃくちゃよく見る論文の著者今はGoogleにいるらしい YuWd (Yuiga Wada) Vision-and-Language 機械学習人物 post 【論文メモ】MuKEA: Multimodal Knowledge Extraction and Accumulation for Knowledge-based Visual Question Answering https://yuiga.dev/blog/en/ja/posts/mukea_multimodal_knowledge_extraction_and_accumulation_for_knowledge-based_visual_question_answering/ Wed, 24 Aug 2022 04:13:02 +0900 Wed, 24 Aug 2022 04:13:02 +0900 https://yuiga.dev/blog/en/ja/posts/mukea_multimodal_knowledge_extraction_and_accumulation_for_knowledge-based_visual_question_answering/ CVPR22 タスク: KB-VQA 質問画像に含まれていない知識を要する質問に回答するタスク例えば, 以下のVQAでは, 外部知識=kawasakiを使わないと回答できない新規性知識グラフの構築は行わない scene graphを作るのではなく, 画像由来のHead Entity (領域画像)と, 言語由来のTail Entity (後述)について, (entity, relation, entity)のtripletを用い YuWd (Yuiga Wada) featured image 論文 Vision-and-Language 【論文メモ】Generating Semantically Precise Scene Graphs from Textual Descriptions for Improved Image Retrieval https://yuiga.dev/blog/en/ja/posts/generating_semantically_precise_scene_graphs_from_textual_descriptions_for_improved_image_retrieval/ Wed, 24 Aug 2022 02:21:50 +0900 Wed, 24 Aug 2022 02:21:50 +0900 https://yuiga.dev/blog/en/ja/posts/generating_semantically_precise_scene_graphs_from_textual_descriptions_for_improved_image_retrieval/ Stanford Scene Graph Parserの論文 (ACL 2015) 一応, scene graphを自動化してimage retrievalできるようにしようという趣旨 https://nlp.stanford.edu/software/scenegraph-parser.shtml 流れ ①Universal Dependenciesを一部修正したものをsemantic graphとして生成 a lot of 等のquantificational modifiersの修正代名詞の解釈複数名詞への対応 → ノー YuWd (Yuiga Wada) 論文 NLP Vision-and-Language Graph 【論文メモ】SPICE: Semantic Propositional Image Caption Evaluation https://yuiga.dev/blog/en/ja/posts/spice_semantic_propositional_image_caption_evaluation/ Tue, 16 Aug 2022 20:46:30 +0900 Tue, 16 Aug 2022 20:46:30 +0900 https://yuiga.dev/blog/en/ja/posts/spice_semantic_propositional_image_caption_evaluation/ 評価指標SPICEの論文 (ECCV 2016) BLEUなどはn-gramの重なりにsensitiveで, 真の意味でsemanticsを評価しているとは言えないそこで, scene graphを用いた評価指標SPICEを提案実際, 画像キャプショニングモデルではよく見かける指標となってきた流れ ① 複数キャプションからscene graphを生成 scene graph YuWd (Yuiga Wada) featured image 論文 NLP Vision-and-Language Graph 日本語キャプションデータセット https://yuiga.dev/blog/en/ja/posts/%E6%97%A5%E6%9C%AC%E8%AA%9E%E3%82%AD%E3%83%A3%E3%83%97%E3%82%B7%E3%83%A7%E3%83%B3%E3%83%87%E3%83%BC%E3%82%BF%E3%82%BB%E3%83%83%E3%83%88/ Mon, 15 Aug 2022 23:36:52 +0900 Mon, 15 Aug 2022 23:36:52 +0900 https://yuiga.dev/blog/en/ja/posts/%E6%97%A5%E6%9C%AC%E8%AA%9E%E3%82%AD%E3%83%A3%E3%83%97%E3%82%B7%E3%83%A7%E3%83%B3%E3%83%87%E3%83%BC%E3%82%BF%E3%82%BB%E3%83%83%E3%83%88/ STAIR MSCOCOにキャプションを付与全部で820,310件のキャプション http://captions.stair.center/ Yuya Yoshikawa, Yutaro Shigeto, and Akikazu Takeuchi, “STAIR Captions: Constructing a Large-Scale Japanese Image Caption Dataset”, Annual Meeting of the Association for Computational Linguistics (ACL), Short Paper, 2017. YJ Captions 26k Dataset こちらもMSCOCOにキャプションを付与したもので, ACL2016 キャプション数がSTAIRの1/6程度 https://github.com/yahoojapan/YJCaptions Takashi Miyazaki and Nobuyuki Shimizu. 2016. Cross-Lingual Image Caption Generation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1780 YuWd (Yuiga Wada) 機械学習 NLP Vision-and-Language post 【論文メモ】OTTER: Data Efficient Language-Supervised Zero-Shot Recognition with Optimal Transport Distillation https://yuiga.dev/blog/en/ja/posts/otter_data_efficient_language-supervised_zero-shot_recognition_with_optimal_transport_distillation/ Wed, 10 Aug 2022 18:01:53 +0900 Wed, 10 Aug 2022 18:01:53 +0900 https://yuiga.dev/blog/en/ja/posts/otter_data_efficient_language-supervised_zero-shot_recognition_with_optimal_transport_distillation/ モチベーション CLIPは単位行列を教師として学習する → バッチ内の負例同士にゆるい相関があった場合, 負例を全て0として学習するのは違うよね → 最適輸送問題を解いたものを教師として活用しよう OTTER (Optimal TransporT distillation for Efficient zero-shot Recognition) を提案 Prototypical Contrastive Learning of Unsupervised Representationsと若干同じ感じ loss InfoNCEを拡張して $$\mathcal{L}_v = -\frac{1}{N} \sum_{i=1}^N \sum_{j=1}^N [\alpha I_{ij} + (1-\alpha) M^{v}_{ij}\rbrack \log p_v(\mathbf{z}_i^v, \mathbf{z}_j^t;\tau)$$ とするイ YuWd (Yuiga Wada) featured image 論文 Vision-and-Language 【論文メモ】Shifting More Attention to Visual Backbone: Query-modulated Refinement Networks for End-to-End Visual Grounding https://yuiga.dev/blog/en/ja/posts/shifting_more_attention_to_visual_backbone_query-modulated_refinement_networks_for_end-to-end_visual_grounding/ Mon, 25 Jul 2022 12:30:45 +0900 Mon, 25 Jul 2022 12:30:45 +0900 https://yuiga.dev/blog/en/ja/posts/shifting_more_attention_to_visual_backbone_query-modulated_refinement_networks_for_end-to-end_visual_grounding/ 通常のV&Lモデルでは, 画像のバックボーンネットワークは言語特徴量を使用しないそのようなモデルでは, 「画像にりんごはいくつあるか？」などといったVQAタスクすら解けない(可能性が高い) そこで, SwinTransformerを拡張し, 各ステージで言語特徴量をspatial / channel方向にmixしながら推論し YuWd (Yuiga Wada) featured image 論文 Vision-and-Language 【論文メモ】BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation https://yuiga.dev/blog/en/ja/posts/blip_bootstrapping_language-image_pre-training_for_unified_vision-language_understanding_and_generation/ Mon, 25 Jul 2022 00:48:05 +0900 Mon, 25 Jul 2022 00:48:05 +0900 https://yuiga.dev/blog/en/ja/posts/blip_bootstrapping_language-image_pre-training_for_unified_vision-language_understanding_and_generation/ 提案手法は主に２つの機構で構成される Multimodal mixture of Encoder-Decoder (MED) Captioning and Filtering (CapFilt): CLIPの使用するデータセットはnoisy なので, キャプションの取捨選択を自動で行う機構を導入流れノイズを含む元のデータセットでMEDを学習事前学習されたMEDを用いてCapFiltを実行 CapFiitによって得られたデータセットを用いて再度MEDを学習 MED Image-TextContrastiveLoss(ITC) 画像特徴 YuWd (Yuiga Wada) featured image 論文 Vision-and-Language 【論文メモ】Think Global, Act Local: Dual-scale Graph Transformer for Vision-and-Language Navigation https://yuiga.dev/blog/en/ja/posts/think_global_act_local_dual-scale_graph_transformer_for_vision-and-language_navigation/ Thu, 07 Jul 2022 02:01:33 +0900 Thu, 07 Jul 2022 02:01:33 +0900 https://yuiga.dev/blog/en/ja/posts/think_global_act_local_dual-scale_graph_transformer_for_vision-and-language_navigation/ VLN-DUET 概要 localな情報とグラフを用いたglobalな情報の両方を統合してactionを決定する actionが決定されたら, Graphを動的に構築して, 移動先までの最短経路をワーシャルフロイドで探索各ノードには, viewから得られた特徴量を埋め込み表現として保持する行動 $a^\pi$は各ノードへの尤度によって表現され, ノ YuWd (Yuiga Wada) featured image 論文 Vision-and-Language 【論文メモ】REVERIE - Remote Embodied Visual Referring Expression in Real Indoor Environments https://yuiga.dev/blog/en/ja/posts/reverie_-_remote_embodied_visual_referring_expression_in_real_indoor_environments/ Sun, 26 Jun 2022 17:18:43 +0900 Sun, 26 Jun 2022 17:18:43 +0900 https://yuiga.dev/blog/en/ja/posts/reverie_-_remote_embodied_visual_referring_expression_in_real_indoor_environments/ YuWd (Yuiga Wada) featured image 論文 multi-modal Vision-and-Language Matterport3DSimulatorをCUDA11.1で動かす https://yuiga.dev/blog/en/ja/posts/matterport3dsimulator%E3%82%92cuda11.1%E3%81%A7%E5%8B%95%E3%81%8B%E3%81%99/ Sat, 25 Jun 2022 00:33:47 +0900 Sat, 25 Jun 2022 00:33:47 +0900 https://yuiga.dev/blog/en/ja/posts/matterport3dsimulator%E3%82%92cuda11.1%E3%81%A7%E5%8B%95%E3%81%8B%E3%81%99/ Matterport3DSimulatorをCUDA11.1で動かすDockerfile 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 FROMnvcr.io/nvidia/pytorch:19.05-py3FROMphp:7.1.9-apacheFROMnvidia/cuda:11.1-cudnn8-devel-ubuntu18.04RUN rm /etc/apt/sources.list.d/cuda.listRUN rm /etc/apt/sources.list.d/nvidia-ml.listRUN apt-key del 7fa2af80RUN apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/3bf863cc.pubRUN apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/7fa2af80.pubRUN apt-get updateRUN apt-get -y upgradeRUN apt-get -y install nano wget curl# ONNX Runtime Training Module for PyTorch# Copyright (c) Microsoft Corporation. All rights reserved.# Licensed under the MIT License.ARG TORCH_CUDA_VERSION=cu111 ARG TORCH_VERSION=1.8.1ARG TORCHVISION_VERSION=0.9.1# Install and update tools to minimize security vulnerabilitiesRUN apt-get updateRUN apt-get install -y software-properties-common wget apt-utils patchelf git libprotobuf-dev protobuf-compiler cmake RUN unattended-upgradeRUN YuWd (Yuiga Wada) 機械学習 Vision-and-Language Docker Matterport3D post