Need help with Flaubert?
Click the “chat” button below for chat support from the developer who created it, or find similar developers for support.

About the developer

getalp
194 Stars 25 Forks Other 188 Commits 5 Opened issues

Description

Unsupervised Language Model Pre-training for French

Services available

!
?

Need anything else?

Contributors list

FlauBERT and FLUE

FlauBERT is a French BERT trained on a very large and heterogeneous French corpus. Models of different sizes are trained using the new CNRS (French National Centre for Scientific Research) Jean Zay supercomputer. This repository shares everything: pre-trained models (base and large), the data, the code to use the models and the code to train them if you need.

Along with FlauBERT comes FLUE: an evaluation setup for French NLP systems similar to the popular GLUE benchmark. The goal is to enable further reproducible experiments in the future and to share models and progress on the French language.

This repository is still under construction and everything will be available soon.

Table of Contents

1. FlauBERT models
2. Using FlauBERT
    2.1. Using FlauBERT with Hugging Face's Transformers
    2.2. Using FlauBERT with Facebook XLM's library
3. Pre-training FlauBERT
    3.1. Data
    3.2. Training
    3.3. Convert an XLM pre-trained model to Hugging Face's Transformers
4. Fine-tuning FlauBERT on the FLUE benchmark
5. Citation <!--     3.1. Text Classification
    3.2. Paraphrasing
    3.3. Natural Language Inference
    3.4. Constituency Parsing
    3.5. Word Sense Disambiguation -->

1. FlauBERT models

FlauBERT is a French BERT trained on a very large and heterogeneous French corpus. Models of different sizes are trained using the new CNRS (French National Centre for Scientific Research) Jean Zay supercomputer. We have released the pretrained weights for the following model sizes.

The pretrained models are available for download from here or via Hugging Face's library.

| Model name | Number of layers | Attention Heads | Embedding Dimension | Total Parameters | | :------: | :---: | :---: | :---: | :---: | |

flaubert-small-cased
| 6 | 8 | 512 | 54 M | |
flaubert-base-uncased
| 12 | 12 | 768 | 137 M | |
flaubert-base-cased
| 12 | 12 | 768 | 138 M | |
flaubert-large-cased
| 24 | 16 | 1024 | 373 M |

Note:

flaubert-small-cased
is partially trained so performance is not guaranteed. Consider using it for debugging purpose only.

We also provide the checkpoints from here for model base (cased/uncased) and large (cased).

2. Using FlauBERT

In this section, we describe two ways to obtain sentence embeddings from pretrained FlauBERT models: either via Hugging Face's Transformer library or via Facebook's XLM library. We will intergrate FlauBERT into Facebook' fairseq in the near future.

2.1. Using FlauBERT with Hugging Face's Transformers

You can use FlauBERT with Hugging Face's Transformers library as follow. <!-- First, you need to install a Transformers version that contains FlauBERT. At the time of writing, our pull request has not been merged into the official Hugging Face’s repo yet so you would need to install it from our fork:

pip install --upgrade --force-reinstall git+https://github.com/formiel/transformers.git

(We will make sure to keep this fork up-to-date with the original

transformers
master branch.)

After the installation you can use FlauBERT in a native way: -->

import torch
from transformers import FlaubertModel, FlaubertTokenizer

Choose among ['flaubert/flaubert_small_cased', 'flaubert/flaubert_base_uncased',

'flaubert/flaubert_base_cased', 'flaubert/flaubert_large_cased']

modelname = 'flaubert/flaubert_base_cased'

Load pretrained model and tokenizer

flaubert, log = FlaubertModel.from_pretrained(modelname, output_loading_info=True) flaubert_tokenizer = FlaubertTokenizer.from_pretrained(modelname, do_lowercase=False)

do_lowercase=False if using cased models, True if using uncased ones

sentence = "Le chat mange une pomme." token_ids = torch.tensor([flaubert_tokenizer.encode(sentence)])

last_layer = flaubert(token_ids)[0] print(last_layer.shape)

torch.Size([1, 8, 768]) -> (batch size x number of tokens x embedding dimension)

The BERT [CLS] token correspond to the first hidden state of the last layer

cls_embedding = last_layer[:, 0, :]

Notes: if your

transformers
version is <=2.10.0,
modelname
should take one of the following values:
['flaubert-small-cased', 'flaubert-base-uncased', 'flaubert-base-cased', 'flaubert-large-cased']

5. Video presentation

You can watch this 7mn video presentation of FlauBERT VIDEO 7mn

6. Citation

If you use FlauBERT or the FLUE Benchmark for your scientific publication, or if you find the resources in this repository useful, please cite one of the following papers:

LREC paper

@InProceedings{le2020flaubert,
  author    = {Le, Hang  and  Vial, Lo\"{i}c  and  Frej, Jibril  and  Segonne, Vincent  and  Coavoux, Maximin  and  Lecouteux, Benjamin  and  Allauzen, Alexandre  and  Crabb\'{e}, Beno\^{i}t  and  Besacier, Laurent  and  Schwab, Didier},
  title     = {FlauBERT: Unsupervised Language Model Pre-training for French},
  booktitle = {Proceedings of The 12th Language Resources and Evaluation Conference},
  month     = {May},
  year      = {2020},
  address   = {Marseille, France},
  publisher = {European Language Resources Association},
  pages     = {2479--2490},
  url       = {https://www.aclweb.org/anthology/2020.lrec-1.302}
}

TALN paper

@inproceedings{le2020flaubert,
  title         = {FlauBERT: des mod{\`e}les de langue contextualis{\'e}s pr{\'e}-entra{\^\i}n{\'e}s pour le fran{\c{c}}ais},
  author        = {Le, Hang and Vial, Lo{\"\i}c and Frej, Jibril and Segonne, Vincent and Coavoux, Maximin and Lecouteux, Benjamin and Allauzen, Alexandre and Crabb{\'e}, Beno{\^\i}t and Besacier, Laurent and Schwab, Didier},
  booktitle     = {Actes de la 6e conf{\'e}rence conjointe Journ{\'e}es d'{\'E}tudes sur la Parole (JEP, 31e {\'e}dition), Traitement Automatique des Langues Naturelles (TALN, 27e {\'e}dition), Rencontre des {\'E}tudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (R{\'E}CITAL, 22e {\'e}dition). Volume 2: Traitement Automatique des Langues Naturelles},
  pages         = {268--278},
  year          = {2020},
  organization  = {ATALA}
}

We use cookies. If you continue to browse the site, you agree to the use of cookies. For more information on our use of cookies please see our Privacy Policy.