15th WORKSHOP ON BUILDING AND USING COMPARABLE CORPORA (BUCC)
Co-located with LREC 2022 (Marseille)
Saturday, June 25, 2022
Paper submission deadline (extended): April 20, 2022Workshop website:
https://comparable.limsi.fr/bucc2022/LREC website:
https://lrec2022.lrec-conf.org/en/**************************************************************
MOTIVATION
In the language engineering and the linguistics communities, research in
comparable corpora has been motivated by two main reasons. In language
engineering, on the one hand, it is primarily motivated by the need to
use comparable corpora as training data for statistical NLP applications
such as statistical and neural machine translation or cross-lingual
information retrieval. In linguistics, on the other hand, comparable
corpora are of interest because they enable cross-language discoveries
and comparisons. It is generally accepted in both communities that
comparable corpora consist of documents that are comparable in content
and form in various degrees and dimensions across several languages,
dialects, or varieties. Parallel corpora are on the one end of this
spectrum, unrelated corpora on the other.
TOPICS
We solicit contributions on all topics related to comparable (and
parallel) corpora, including but not limited to the following:
Building Comparable Corpora:
* Automatic and semi-automatic methods
* Methods to mine parallel and non-parallel corpora from the web
* Tools and criteria to evaluate the comparability of corpora
* Parallel vs non-parallel corpora, monolingual corpora
* Rare and minority languages, across language families
* Multi-media/multi-modal comparable corpora
Applications of comparable corpora:
* Human translation
* Language learning
* Cross-language information retrieval & document categorization
* Bilingual and multilingual projections
* (Unsupervised) machine translation
* Writing assistance
* Machine learning techniques using comparable corpora
Mining from Comparable Corpora:
* Cross-language distributional semantics and pre-trained multilingual
transformer models
* Creation of bilingual and multilingual embeddings from comparable corpora
* Methods to derive parallel from non-parallel corpora (e.g. to provide
for low-resource languages in neural machine translation)
* Extraction of bilingual and multilingual translations of single words,
multi-word expressions, proper names, named entities, sentences, and
paraphrases from comparable corpora, etc.
* Induction of morphological, grammatical, and translation rules from
comparable corpora
* Induction of multilingual word classes from comparable corpora
Comparable Corpora in the Humanities:
* Comparing linguistic phenomena across languages in contrastive linguistics
* Analyzing properties of translated language in translation studies
* Studying language change over time in diachronic linguistics
* Assigning texts to authors via authors' corpora in forensic linguistics
* Comparing rhetorical features in discourse analysis
* Studying cultural differences in sociolinguistics
* Analyzing language universals in typological research
IMPORTANT DATES
April 20, 2022: Paper submission deadline (extended)
May 3, 2022: Notification of acceptance
May 23, 2022: Camera ready final papers
June 25, 2022: Workshop date
For updates see the workshop website at
https://comparable.limsi.fr/bucc2022/PRACTICAL INFORMATION
Registration for the workshop will be via the main conference website at
https://lrec2022.lrec-conf.org/en/SUBMISSION GUIDELINES
Please follow the style sheet and templates provided for the main
conference at
https://lrec2022.lrec-conf.org/en/submission2022/authors-kit/Papers should be submitted as a PDF file using the START conference
manager at
https://www.softconf.com/lrec2022/BUCC/Submissions must describe original and unpublished work and range from 4
to 8 pages plus unlimited references.
It is the authors' choice whether or not to reveal their identities in
their manuscripts submitted for review. Accepted papers will be
published in the workshop proceedings.
Double submission policy: Parallel submission to other meetings or
publications is possible but must be immediately notified to the
workshop organizers by e-mail.
For further information and updates see the BUCC 2022 website:
https://comparable.limsi.fr/bucc2022/In case of questions, please contact Reinhard Rapp: reinhardrapp (at)
gmx (dot) de
BUCC 2022 SHARED TASK: bilingual term alignment in comparable
specialized corpora
See the shared task website at
https://comparable.limsi.fr/bucc2022/bucc2022-task.htmlWORKSHOP ORGANIZERS AND CONTACT
* Reinhard Rapp (Athena R.C., Greece; Magdeburg-Stendal University of
Applied Sciences and University of Mainz, Germany)
* Pierre Zweigenbaum (Université Paris-Saclay, CNRS, LISN, Orsay, France)
* Serge Sharoff (University of Leeds, United Kingdom)
Contact workshop: reinhardrapp (at) gmx (dot) de
Contact shared task: pz (at) lisn (dot) fr
PROGRAMME COMMITTEE
* Ahmet Aker (University of Duisburg-Essen, Germany)
* Ebrahim Ansari (Institue for Advanced Studies in Basic Sciences, Iran)
* Thierry Etchegoyhen (Vicomtech, Spain)
* Hitoshi Isahara (Otemon Gakuin University, Japan)
* Kyo Kageura (University of Tokyo, Japan)
* Natalie Kübler (CLILLAC-ARP, Université de Paris, France)
* Philippe Langlais (Université de Montréal, Canada)
* Yve Lepage (Waseda University, Japan)
* Michael Mohler (Language Computer Corporation, USA)
* Emmanuel Morin (Université de Nantes, France)
* Dragos Stefan Munteanu (RWS, USA)
* Ted Pedersen (University of Minnesota, Duluth, USA)
* Reinhard Rapp (Athena R.C., Greece; Magdeburg-Stendal University of
Applied Sciences and University of Mainz, Germany)
* Nasredine Semmar (CEA LIST, Paris, France)
* Serge Sharoff (University of Leeds, UK)
* Richard Sproat (OGI School of Science & Technology, USA)
* Ted Pedersen (University of Minnesota, Duluth, USA)
* Pierre Zweigenbaum (LISN, CNRS, Université Paris-Saclay, Orsay, France)
INFORMATION FROM THE LREC ORGANIZERS
* Describing your LRs in the LRE Map is now a normal practice in the
submission procedure of LREC (introduced in 2010 and adopted by other
conferences). To continue the efforts initiated at LREC 2014 about
“Sharing LRs” (data, tools, web-services, etc.), authors will have the
possibility, when submitting a paper, to upload LRs in a special LREC
repository. This effort of sharing LRs, linked to the LRE Map for their
description, may become a new “regular” feature for conferences in our
field, thus contributing to creating a common repository where everyone
can deposit and share data.
* As scientific work requires accurate citations of referenced work so
as to allow the community to understand the whole context and also
replicate the experiments conducted by other researchers, LREC 2022
endorses the need to uniquely Identify LRs through the use of the
International Standard Language Resource Number (ISLRN, www.islrn.org),
a Persistent Unique Identifier to be assigned to each Language Resource.
The assignment of ISLRNs to LRs cited in LREC papers will be offered at
submission time.