<?xml version="1.0" encoding="utf-8" standalone="yes" ?>
<rss version="2.0" 
  xmlns:content="http://purl.org/rss/1.0/modules/content/" 
  xmlns:dc="http://purl.org/dc/elements/1.1/" 
  xmlns:atom="http://www.w3.org/2005/Atom" 
  xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" 
  xmlns:media="http://search.yahoo.com/mrss/">
  <channel>
    <title>Vision-and-Language on 行李の底に収めたり[YuWd]</title>
    <link>https://yuiga.dev/blog/en/tags/vision-and-language/</link>
    <description>Recent content in Vision-and-Language on 行李の底に収めたり[YuWd]</description>
    <generator>Hugo -- gohugo.io</generator>
    <language>en</language>
    <copyright>©2026, All Rights Reserved</copyright>
    <lastBuildDate>Thu, 24 Nov 2022 20:09:23 +0900</lastBuildDate>
    
        <atom:link href="https://yuiga.dev/blog/en/tags/vision-and-language/index.xml" rel="self" type="application/rss+xml" />
    

      
      <item>
        <title>How to create Matterport3D segmentation images?</title>
        <link>https://yuiga.dev/blog/en/ja/posts/matterport3d_semantic_segmentation/</link>
        <pubDate>Thu, 24 Nov 2022 20:09:23 +0900</pubDate>
        
        <atom:modified>Thu, 24 Nov 2022 20:09:23 +0900</atom:modified>
        <guid>https://yuiga.dev/blog/en/ja/posts/matterport3d_semantic_segmentation/</guid>
        <description>Intro The other day, one of my labmates needed to make a segmentation of Matterport3D. He asked for help, and I got involved in creating the segmentation. However, it turned out to be a real struggle. We were not used to 3D mesh models.
After several weeks, we completed the code to create a semantic segmentation image for Matterport3D.
  How to create Matterport3D segmentation images  Matterport3D provides access to 3D segmentation but does not give users an easy way to access 2D. Matterport3D data only provides point clouds and meshes labeled by ground truth, and the user must add color directly to the point clouds and meshes to create 2D segmentations.
We, therefore, wrote code using Matterport3DSimulator to place a camera for a given scan_id and viewpoint_id and create a segmentation from the original ply file.
When we run our code, we get the following image. (I concatenated the obtained images and converted to a gif)
  Matterport3DSimulator takes a total of 36 pictures: 12 at the top, 12 at the perimeter, and 12 at the bottom.</description>
        
        <dc:creator>YuWd (Yuiga Wada)</dc:creator>
        <media:content url="https://yuiga.dev/bloghttps://gyazo.com/154d6dd2dab0d8f33c34767bf21caed3.gif" medium="image"><media:title type="html">featured image</media:title></media:content>
        
        
        
          
            
              <category>Matterport3D</category>
            
          
            
              <category>python</category>
            
          
            
              <category>CV</category>
            
          
            
              <category>Vision-and-Language</category>
            
          
            
              <category>post</category>
            
          
        
        
        
          
            
          
        
      </item>
      
      <item>
        <title>Peter Anderson</title>
        <link>https://yuiga.dev/blog/en/ja/posts/peter_anderson/</link>
        <pubDate>Fri, 26 Aug 2022 19:57:39 +0900</pubDate>
        
        <atom:modified>Fri, 26 Aug 2022 19:57:39 +0900</atom:modified>
        <guid>https://yuiga.dev/blog/en/ja/posts/peter_anderson/</guid>
        <description>すげえ人 SPICE SPICE: Semantic Propositional Image Caption Evaluation REVERIE REVERIE - Remote Embodied Visual Referring Expression in Real Indoor Environments Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering Sim-to-Real Transfer for Vision-and-Language Navigation など, めちゃくちゃよく見る論文の著者 今はGoogleにいるらしい</description>
        
        <dc:creator>YuWd (Yuiga Wada)</dc:creator>
        
        
        
        
          
            
              <category>Vision-and-Language</category>
            
          
            
              <category>機械学習</category>
            
          
            
              <category>人物</category>
            
          
            
              <category>post</category>
            
          
        
        
        
          
            
          
        
      </item>
      
      <item>
        <title>【論文メモ】MuKEA: Multimodal Knowledge Extraction and Accumulation for Knowledge-based Visual Question Answering</title>
        <link>https://yuiga.dev/blog/en/ja/posts/mukea_multimodal_knowledge_extraction_and_accumulation_for_knowledge-based_visual_question_answering/</link>
        <pubDate>Wed, 24 Aug 2022 04:13:02 +0900</pubDate>
        
        <atom:modified>Wed, 24 Aug 2022 04:13:02 +0900</atom:modified>
        <guid>https://yuiga.dev/blog/en/ja/posts/mukea_multimodal_knowledge_extraction_and_accumulation_for_knowledge-based_visual_question_answering/</guid>
        <description>CVPR22 タスク: KB-VQA 質問画像に含まれていない知識を要する質問に回答するタスク 例えば, 以下のVQAでは, 外部知識=kawasakiを使わないと回答できない 新規性 知識グラフの構築は行わない scene graphを作るのではなく, 画像由来のHead Entity (領域画像)と, 言語由来のTail Entity (後述)について, (entity, relation, entity)のtripletを用い</description>
        
        <dc:creator>YuWd (Yuiga Wada)</dc:creator>
        <media:content url="https://yuiga.dev/bloghttps://gyazo.com/51294e3560f51ee9d6d93c80b996f856.png" medium="image"><media:title type="html">featured image</media:title></media:content>
        
        
        
          
            
              <category>論文</category>
            
          
            
              <category>Vision-and-Language</category>
            
          
        
        
        
          
            
          
        
      </item>
      
      <item>
        <title>【論文メモ】Generating Semantically Precise Scene Graphs from Textual Descriptions for Improved Image Retrieval</title>
        <link>https://yuiga.dev/blog/en/ja/posts/generating_semantically_precise_scene_graphs_from_textual_descriptions_for_improved_image_retrieval/</link>
        <pubDate>Wed, 24 Aug 2022 02:21:50 +0900</pubDate>
        
        <atom:modified>Wed, 24 Aug 2022 02:21:50 +0900</atom:modified>
        <guid>https://yuiga.dev/blog/en/ja/posts/generating_semantically_precise_scene_graphs_from_textual_descriptions_for_improved_image_retrieval/</guid>
        <description>Stanford Scene Graph Parserの論文 (ACL 2015) 一応, scene graphを自動化してimage retrievalできるようにしようという趣旨 https://nlp.stanford.edu/software/scenegraph-parser.shtml 流れ ①Universal Dependenciesを一部修正したものをsemantic graphとして生成 a lot of 等のquantificational modifiersの修正 代名詞の解釈 複数名詞への対応 → ノー</description>
        
        <dc:creator>YuWd (Yuiga Wada)</dc:creator>
        
        
        
        
          
            
              <category>論文</category>
            
          
            
              <category>NLP</category>
            
          
            
              <category>Vision-and-Language</category>
            
          
            
              <category>Graph</category>
            
          
        
        
        
          
            
          
        
      </item>
      
      <item>
        <title>【論文メモ】SPICE: Semantic Propositional Image Caption Evaluation</title>
        <link>https://yuiga.dev/blog/en/ja/posts/spice_semantic_propositional_image_caption_evaluation/</link>
        <pubDate>Tue, 16 Aug 2022 20:46:30 +0900</pubDate>
        
        <atom:modified>Tue, 16 Aug 2022 20:46:30 +0900</atom:modified>
        <guid>https://yuiga.dev/blog/en/ja/posts/spice_semantic_propositional_image_caption_evaluation/</guid>
        <description>評価指標SPICEの論文 (ECCV 2016) BLEUなどはn-gramの重なりにsensitiveで, 真の意味でsemanticsを評価しているとは言えない そこで, scene graphを用いた評価指標SPICEを提案 実際, 画像キャプショニングモデルではよく見かける指標となってきた 流れ ① 複数キャプションからscene graphを生成 scene graph</description>
        
        <dc:creator>YuWd (Yuiga Wada)</dc:creator>
        <media:content url="https://yuiga.dev/bloghttps://gyazo.com/86b2eefc57b88f16f77dc26d56a26094.png" medium="image"><media:title type="html">featured image</media:title></media:content>
        
        
        
          
            
              <category>論文</category>
            
          
            
              <category>NLP</category>
            
          
            
              <category>Vision-and-Language</category>
            
          
            
              <category>Graph</category>
            
          
        
        
        
          
            
          
        
      </item>
      
      <item>
        <title>日本語キャプションデータセット</title>
        <link>https://yuiga.dev/blog/en/ja/posts/%E6%97%A5%E6%9C%AC%E8%AA%9E%E3%82%AD%E3%83%A3%E3%83%97%E3%82%B7%E3%83%A7%E3%83%B3%E3%83%87%E3%83%BC%E3%82%BF%E3%82%BB%E3%83%83%E3%83%88/</link>
        <pubDate>Mon, 15 Aug 2022 23:36:52 +0900</pubDate>
        
        <atom:modified>Mon, 15 Aug 2022 23:36:52 +0900</atom:modified>
        <guid>https://yuiga.dev/blog/en/ja/posts/%E6%97%A5%E6%9C%AC%E8%AA%9E%E3%82%AD%E3%83%A3%E3%83%97%E3%82%B7%E3%83%A7%E3%83%B3%E3%83%87%E3%83%BC%E3%82%BF%E3%82%BB%E3%83%83%E3%83%88/</guid>
        <description>STAIR MSCOCOにキャプションを付与 全部で820,310件のキャプション http://captions.stair.center/ Yuya Yoshikawa, Yutaro Shigeto, and Akikazu Takeuchi, “STAIR Captions: Constructing a Large-Scale Japanese Image Caption Dataset”, Annual Meeting of the Association for Computational Linguistics (ACL), Short Paper, 2017. YJ Captions 26k Dataset こちらもMSCOCOにキャプションを付与したもので, ACL2016 キャプション数がSTAIRの1/6程度 https://github.com/yahoojapan/YJCaptions Takashi Miyazaki and Nobuyuki Shimizu. 2016. Cross-Lingual Image Caption Generation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1780</description>
        
        <dc:creator>YuWd (Yuiga Wada)</dc:creator>
        
        
        
        
          
            
              <category>機械学習</category>
            
          
            
              <category>NLP</category>
            
          
            
              <category>Vision-and-Language</category>
            
          
            
              <category>post</category>
            
          
        
        
        
          
            
          
        
      </item>
      
      <item>
        <title>【論文メモ】OTTER: Data Efficient Language-Supervised Zero-Shot Recognition with Optimal Transport Distillation</title>
        <link>https://yuiga.dev/blog/en/ja/posts/otter_data_efficient_language-supervised_zero-shot_recognition_with_optimal_transport_distillation/</link>
        <pubDate>Wed, 10 Aug 2022 18:01:53 +0900</pubDate>
        
        <atom:modified>Wed, 10 Aug 2022 18:01:53 +0900</atom:modified>
        <guid>https://yuiga.dev/blog/en/ja/posts/otter_data_efficient_language-supervised_zero-shot_recognition_with_optimal_transport_distillation/</guid>
        <description>モチベーション CLIPは単位行列を教師として学習する → バッチ内の負例同士にゆるい相関があった場合, 負例を全て0として学習するのは違うよね → 最適輸送問題を解いたものを教師として活用しよう OTTER (Optimal TransporT distillation for Efficient zero-shot Recognition) を提案 Prototypical Contrastive Learning of Unsupervised Representationsと若干同じ感じ loss InfoNCEを拡張して $$\mathcal{L}_v = -\frac{1}{N} \sum_{i=1}^N \sum_{j=1}^N [\alpha I_{ij} + (1-\alpha) M^{v}_{ij}\rbrack \log p_v(\mathbf{z}_i^v, \mathbf{z}_j^t;\tau)$$ とする イ</description>
        
        <dc:creator>YuWd (Yuiga Wada)</dc:creator>
        <media:content url="https://yuiga.dev/bloghttps://gyazo.com/4dbdcb91a7bb20347b521aeccd47c222.png" medium="image"><media:title type="html">featured image</media:title></media:content>
        
        
        
          
            
              <category>論文</category>
            
          
            
              <category>Vision-and-Language</category>
            
          
        
        
        
          
            
          
        
      </item>
      
      <item>
        <title>【論文メモ】Shifting More Attention to Visual Backbone: Query-modulated Refinement Networks for End-to-End Visual Grounding</title>
        <link>https://yuiga.dev/blog/en/ja/posts/shifting_more_attention_to_visual_backbone_query-modulated_refinement_networks_for_end-to-end_visual_grounding/</link>
        <pubDate>Mon, 25 Jul 2022 12:30:45 +0900</pubDate>
        
        <atom:modified>Mon, 25 Jul 2022 12:30:45 +0900</atom:modified>
        <guid>https://yuiga.dev/blog/en/ja/posts/shifting_more_attention_to_visual_backbone_query-modulated_refinement_networks_for_end-to-end_visual_grounding/</guid>
        <description>通常のV&amp;amp;Lモデルでは, 画像のバックボーンネットワークは言語特徴量を使用しない そのようなモデルでは, 「画像にりんごはいくつあるか？」などといったVQAタスクすら解けない(可能性が高い) そこで, SwinTransformerを拡張し, 各ステージで言語特徴量をspatial / channel方向にmixしながら推論し</description>
        
        <dc:creator>YuWd (Yuiga Wada)</dc:creator>
        <media:content url="https://yuiga.dev/bloghttps://gyazo.com/ac556a8dc0097b4aca251a866ea6d0e4.png" medium="image"><media:title type="html">featured image</media:title></media:content>
        
        
        
          
            
              <category>論文</category>
            
          
            
              <category>Vision-and-Language</category>
            
          
        
        
        
          
            
          
        
      </item>
      
      <item>
        <title>【論文メモ】BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation</title>
        <link>https://yuiga.dev/blog/en/ja/posts/blip_bootstrapping_language-image_pre-training_for_unified_vision-language_understanding_and_generation/</link>
        <pubDate>Mon, 25 Jul 2022 00:48:05 +0900</pubDate>
        
        <atom:modified>Mon, 25 Jul 2022 00:48:05 +0900</atom:modified>
        <guid>https://yuiga.dev/blog/en/ja/posts/blip_bootstrapping_language-image_pre-training_for_unified_vision-language_understanding_and_generation/</guid>
        <description>提案手法は主に２つの機構で構成される Multimodal mixture of Encoder-Decoder (MED) Captioning and Filtering (CapFilt): CLIPの使用するデータセットはnoisy なので, キャプションの取捨選択を自動で行う機構を導入 流れ ノイズを含む元のデータセットでMEDを学習 事前学習されたMEDを用いてCapFiltを実行 CapFiitによって得られたデータセットを用いて再度MEDを学習 MED Image-TextContrastiveLoss(ITC) 画像特徴</description>
        
        <dc:creator>YuWd (Yuiga Wada)</dc:creator>
        <media:content url="https://yuiga.dev/bloghttps://gyazo.com/782b3acbf1406632a3ae1d16055465e8.png" medium="image"><media:title type="html">featured image</media:title></media:content>
        
        
        
          
            
              <category>論文</category>
            
          
            
              <category>Vision-and-Language</category>
            
          
        
        
        
          
            
          
        
      </item>
      
      <item>
        <title>【論文メモ】Think Global, Act Local: Dual-scale Graph Transformer for Vision-and-Language Navigation</title>
        <link>https://yuiga.dev/blog/en/ja/posts/think_global_act_local_dual-scale_graph_transformer_for_vision-and-language_navigation/</link>
        <pubDate>Thu, 07 Jul 2022 02:01:33 +0900</pubDate>
        
        <atom:modified>Thu, 07 Jul 2022 02:01:33 +0900</atom:modified>
        <guid>https://yuiga.dev/blog/en/ja/posts/think_global_act_local_dual-scale_graph_transformer_for_vision-and-language_navigation/</guid>
        <description>VLN-DUET 概要 localな情報とグラフを用いたglobalな情報の両方を統合してactionを決定する actionが決定されたら, Graphを動的に構築して, 移動先までの最短経路をワーシャルフロイドで探索 各ノードには, viewから得られた特徴量を埋め込み表現として保持する 行動 $a^\pi$は各ノードへの尤度によって表現され, ノ</description>
        
        <dc:creator>YuWd (Yuiga Wada)</dc:creator>
        <media:content url="https://yuiga.dev/bloghttps://gyazo.com/84d55d864e6f3ddfa3da97d5ca443f27.png" medium="image"><media:title type="html">featured image</media:title></media:content>
        
        
        
          
            
              <category>論文</category>
            
          
            
              <category>Vision-and-Language</category>
            
          
        
        
        
          
            
          
        
      </item>
      
      <item>
        <title>【論文メモ】REVERIE - Remote Embodied Visual Referring Expression in Real Indoor Environments</title>
        <link>https://yuiga.dev/blog/en/ja/posts/reverie_-_remote_embodied_visual_referring_expression_in_real_indoor_environments/</link>
        <pubDate>Sun, 26 Jun 2022 17:18:43 +0900</pubDate>
        
        <atom:modified>Sun, 26 Jun 2022 17:18:43 +0900</atom:modified>
        <guid>https://yuiga.dev/blog/en/ja/posts/reverie_-_remote_embodied_visual_referring_expression_in_real_indoor_environments/</guid>
        <description></description>
        
        <dc:creator>YuWd (Yuiga Wada)</dc:creator>
        <media:content url="https://yuiga.dev/bloghttps://gyazo.com/c81d7201d9aac75db20a0e2fd3b4d46b.png" medium="image"><media:title type="html">featured image</media:title></media:content>
        
        
        
          
            
              <category>論文</category>
            
          
            
              <category>multi-modal</category>
            
          
            
              <category>Vision-and-Language</category>
            
          
        
        
        
          
            
          
        
      </item>
      
      <item>
        <title>Matterport3DSimulatorをCUDA11.1で動かす</title>
        <link>https://yuiga.dev/blog/en/ja/posts/matterport3dsimulator%E3%82%92cuda11.1%E3%81%A7%E5%8B%95%E3%81%8B%E3%81%99/</link>
        <pubDate>Sat, 25 Jun 2022 00:33:47 +0900</pubDate>
        
        <atom:modified>Sat, 25 Jun 2022 00:33:47 +0900</atom:modified>
        <guid>https://yuiga.dev/blog/en/ja/posts/matterport3dsimulator%E3%82%92cuda11.1%E3%81%A7%E5%8B%95%E3%81%8B%E3%81%99/</guid>
        <description>Matterport3DSimulatorをCUDA11.1で動かすDockerfile 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 FROMnvcr.io/nvidia/pytorch:19.05-py3FROMphp:7.1.9-apacheFROMnvidia/cuda:11.1-cudnn8-devel-ubuntu18.04RUN rm /etc/apt/sources.list.d/cuda.listRUN rm /etc/apt/sources.list.d/nvidia-ml.listRUN apt-key del 7fa2af80RUN apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/3bf863cc.pubRUN apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/7fa2af80.pubRUN apt-get updateRUN apt-get -y upgradeRUN apt-get -y install nano wget curl# ONNX Runtime Training Module for PyTorch# Copyright (c) Microsoft Corporation. All rights reserved.# Licensed under the MIT License.ARG TORCH_CUDA_VERSION=cu111 ARG TORCH_VERSION=1.8.1ARG TORCHVISION_VERSION=0.9.1# Install and update tools to minimize security vulnerabilitiesRUN apt-get updateRUN apt-get install -y software-properties-common wget apt-utils patchelf git libprotobuf-dev protobuf-compiler cmake RUN unattended-upgradeRUN</description>
        
        <dc:creator>YuWd (Yuiga Wada)</dc:creator>
        
        
        
        
          
            
              <category>機械学習</category>
            
          
            
              <category>Vision-and-Language</category>
            
          
            
              <category>Docker</category>
            
          
            
              <category>Matterport3D</category>
            
          
            
              <category>post</category>
            
          
        
        
        
          
            
          
        
      </item>
      

    
  </channel>
</rss>
