AI Supremacy

Share this post
Salesforce AI Presents BLIP
aisupremacy.substack.com

Salesforce AI Presents BLIP

Bootstrapping Language-Image Pre-training for unified Vision-Language understanding/generation.

Michael Spencer
Mar 7
2
Share this post
Salesforce AI Presents BLIP
aisupremacy.substack.com

Read our Archives of free articles.

At AiSupremacy I try to cover a broad range of research in AI including work from Google, DeepMind, OpenAI, Microsoft among others. So it’s now today time to cover Salesforce AI.

What is BLIP?

TL;DR: BLIP is a new pre-training framework for unified vision-language understanding and generation, which achieves state-of-the-art results on a wide range of vision-language tasks.

Li Junnan’s blog feed on Salesforce AI is somewhat helpful to quickly scan what Salesforce AI has been up to.

Twitter avatar for @SFResearchSalesforce AI Research @SFResearch
Meet BLIP: Bootstrapping Language-Image Pre-training for unified Vision-Language understanding/generation. New model architecture + Dataset bootstrapping = SoTA results on a wider range of V+L tasks than other models!
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and GenerationBLIP is a new pre-training framework from Salesforce AI Research for unified vision-language understanding and generation, which achieves state-of-the-art results on a wide range of vision-language tasks.blog.salesforceairesearch.com

February 24th 2022

6 Retweets29 Likes

Checkout Salesforce AI Papers Here

Salesforce AI Research Propose ‘BLIP’: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

  • Vision-language pre-training has been widely adopted to enable AI agents to understand the world and communicate with humans.

  • This approach involves training a model on image-text data to teach how to grasp both visual and written information. The model is pretrained before fine-tuning it. Skipping this step would reduce the model’s performance because the model then must be trained from the beginning on each subsequent task.

  • Vision-language pre-training has proven to improve performance on downstream vision-language tasks like image-text retrieval, image captioning, and visual question answering.

  • But, the majority of available pre-trained models aren’t adaptable enough to a wide range of vision-language tasks. This is because the encoder-based models are more difficult to apply directly to text production tasks, whereas encoder-decoder models have yet to be accepted for image-text retrieval. Further, models are pre-trained on the picture and alt-text pairs automatically acquired from the web. However, site writings frequently misrepresent the visual content of images.

Do you enjoy A.I. articles at the intersection of breaking news, then help me continue to write on the subject. I cannot continue to write without support. Grateful for all tips, patronage and community contributions.

Join 44 other paying subscribers

SOLUTION - BLIP: Bootstrapping Language-Image Processing

To overcome these issues, the salesforce team has come up with BLIP: Bootstrapping Language-Image Processing for comprehension and development of a unified visual language model. BLIP has a novel model architecture that allows for a broader range of downstream tasks than previous methods and also includes a new dataset bootstrapping strategy for learning from noisy web data.

To overcome this issue of captions from noisy image-text pairs, the team uses two modules:

  1. Captioner: a text decoder that works with images to generate synthetic captions as extra training examples based on the web images.

  2. Filter: a text encoder with picture grounding to remove distracting captions that don’t correspond to the visuals.

BLIP is based on a multimodal mixture of encoder-decoder, a multi-task model that can operate in one of three functionalities to achieve listed objectives:

  1. Unimodal encoders: It is used to encode pictures and text separately. It comprises of vision transformer as an image encoder and BERT’s text encoder. The unimodal encoder is activated by Image-Text Contrastive Loss (ITC). In contrast to negative image-text pairs, it seeks to align the feature space of the visual and text transformers by encouraging positive image-text pairs to have similar representations.

  2. The image-grounded text encoder: It has a cross-attention layer inserted between the self-attention layer and the feed-forward network for each transformer block of the text encoder to inject visual information. It is activated by Image-Text Matching Loss (ITM). The ITM challenge asks the model to predict whether an image-text pair is positive (matched) or negative (unmatched) based on their multimodal feature.

  3. Image-grounded text decoder: It uses causal self-attention layers instead of bi-directional self-attention layers in the text encoder. The decode is activated by Language Modeling Loss (LM), which seeks to create textual descriptions based on the images.

Salesforce Accelerating Vision-Language Tasks Performance

The team also notes that the stochastic decoding method is better compared to beam search for caption generation because of the higher level of diversity in the synthetic captions.

The results suggest that BLIP achieves state-of-the-art on seven vision-language tasks, including image-text retrieval image captioning, visual question answering, visual reasoning, visual dialogue, and zero-shot text-video retrieval zero-shot video question answering.

Paper: https://arxiv.org/abs/2201.12086

Project: https://huggingface.co/spaces/Salesforce/BLIP

Github: https://github.com/salesforce/BLIP

References:

  • https://blog.salesforceairesearch.com/blip-bootstrapping-language-image-pretraining/

Read it on Salesforce AI Blog

How Salesforce AI Did It

Our Solution: Flip the Script with BLIP

To address these limitations, the Salesforce AI team proposed BLIP: Bootstrapping Language-Image Pre-training for unified vision-language understanding and generation. BLIP introduces:

  • a new model architecture that enables a wider range of downstream tasks than existing methods, and

  • a new dataset bootstrapping method for learning from noisy web data.

BLIP achieves state-of-the-art performance on seven vision-language tasks, including:

  • image-text retrieval

  • image captioning

  • visual question answering

  • visual reasoning

  • visual dialog

  • zero-shot text-video retrieval

  • zero-shot video question answering.

While Salesforce AI is not usually considered a powerhouse in AI research like FAIR, Microsoft Research, Google AI and so forth this is pretty impressive work.

Synced and MarketTechPost are interesting places if you really want to keep up to date on all of these papers by major AI research teams and corporations. I like how Synced often covers cutting-edge research related to the Top Universities in collaborations with BigTech.

Computer vision remains a hotbed of A.I. research as Salesforce here demonstrates:

Their blog is also easier to read and goes into more depth with visual graphics I

Deep Dive: How BLIP Works

A unified model for vision-language understanding and generation

In order to pre-train a unified vision-language model with both understanding and generation capabilities, BLIP introduces multimodal mixture of encoder-decoder, a multi-task model which can operate in one of the three functionalities:

  1. Unimodal encoders, which separately encode image and text. The image encoder is a vision transformer. The text encoder is the same as BERT. A [CLS] token is appended to the beginning of the text input to summarize the sentence.

  2. Image-grounded text encoder, which injects visual information by inserting a cross-attention layer between the self-attention layer and the feed forward network for each transformer block of the text encoder. A task-specific [Encode] token is appended to the text, and the output embedding of [Encode] is used as the multimodal representation of the image-text pair.

  3. Image-grounded text decoder, which replaces the bi-directional self-attention layers in the text encoder with causal self-attention layers. A special [Decode] token is used to signal the beginning of a sequence.

BLIP jointly optimizes three objectives during pre-training, with two understanding-based objectives (ITC, ITM) and one generation-based objective (LM):

  • Image-Text Contrastive Loss (ITC) activates the unimodal encoder. It aims to align the feature space of the visual transformer and the text transformer by encouraging positive image-text pairs to have similar representations in contrast to the negative pairs.

  • Image-Text Matching Loss (ITM) activates the image-grounded text encoder. ITM is a binary classification task, where the model is asked to predict whether an image-text pair is positive (matched) or negative (unmatched) given their multimodal feature.

  • Language Modeling Loss (LM) activates the image-grounded text decoder, which aims to generate textual descriptions conditioned on the images.

Salesforce AI in the Future

Of particular interest is Salesforce Einstein AI . Salesforce describes this as the a ability to empower everyone with built-in intelligence to engage with empathy, increase productivity, and scale customer experiences.

Of course for Salesforce a CRM, this implies better sales and a more intelligent sales funnel.

  • Get deep insights from your customers based on past interactions.

  • Use these insights to strengthen relationships, prioritize leads, cases, and campaigns to drive your business forward.

The recent Earnings in early 2022 was positive. The company posted adjusted earnings of 84 cents per share on revenue of $7.33 billion. Analysts expected a profit of 74 cents per share on revenue of $7.24 billion, according to Refinitiv. The Slack deal closed in July. San Francisco-based Salesforce said revenue climbed 26% to $7.33 billion, including including $312 million from Slack.

San Francisco-based Salesforce said revenue climbed 26% to $7.33 billion, including including $312 million from Slack. I am pretty optimistic about this company’s ability to contribute in AI research in the present and future.

At AiSupremacy there’s so many topics I want to cover and also dedicate some time to cover the latest research papers. To help with my coverage I created a Newsletter for bite-sized nuggets called Artificial Intelligence Survey.

Artificial Intelligence Survey

Bite size curation of links to A.I. News, funding and trending topics from around the web.
By Michael Spencer

I’m really enjoying covering A.I. on substack. If you enjoy the content like and give me a comment sometimes to keep up my motivation! I sometimes share my articles on Reddit, Hacker News or LinkedIn.

I have no partnerships, corporate sponsors or real income to support this project so I’m entirely dependent on community support. Substack is literally my “content basic income”.

NOTE FROM THE AUTHOR

I cannot continue to write without tips, patronage and community support from you, my readers and audience. I want to keep my articles free for the majority of my readers.

Do you enjoy A.I. articles at the intersection of breaking news, then help me continue to write on the subject. I cannot continue to write without support. Grateful for all tips, patronage and community contributions.

Join 44 other paying subscribers

So by subscribing you are essentially helping fund a network of Newsletters whose aim is to inspire and inform. This is my only job and stream of income.

See My Writing Feed

AiSupremacy is the fastest Substack Newsletter in AI at the intersection of breaking news. It’s ranked #1 in Machine Learning as of January 22nd, 2022.

I’m having a great time trying to cover artificial intelligence papers, news and Op-eds.

Thanks for reading!

Share this post
Salesforce AI Presents BLIP
aisupremacy.substack.com
Comments

Create your profile

0 subscriptions will be displayed on your profile (edit)

Skip for now

Only paid subscribers can comment on this post

Already a paid subscriber? Sign in

Check your email

For your security, we need to re-authenticate you.

Click the link we sent to , or click here to sign in.

TopNewCommunity

No posts

Ready for more?

© 2022 Michael Spencer
Privacy ∙ Terms ∙ Collection notice
Publish on Substack Get the app
Substack is the home for great writing