Module 2 – Coolang

Work modules

Module 2. Analysis of the problem

As a starting point for the scientific tasks to be carried out by the project, it is necessary to initiate activities to describe domains and scenarios, to study existing technologies and algorithms, i.e. the state of the art, and the feasibility of obtaining information for subsequent analysis. An adequate knowledge of the challenges to be addressed will allow for a more rigorous, rational and systematic research process.

Milestones

List and description of selected scenarios.
State of the art in the different tasks to be tackled and oriented to each scenario.
Data sources to be used for each scenario.

Deliverables

Report describing selected scenarios.
List of articles containing the state of the art (relevant bibliography) organised according to task/scenario.
List of data sources.

Task 2.1 Identify and characterising domains and scenarios

This task will identify the domains studied throughout the project, including: news media, social media, politics, biomedicine (with special attention to pandemic crisis), tourism, policy and public administration, advertising and brand reputation, scientific communications, etc. These domains could be analysed and assessed in different languages and in different scenarios, including: lying, violence, harassment, leaks, depression, partisanship, racism, sexism, predators journals, etc. The scenarios at stake can be clustered under the heading “informative disorder” (Wardle, Derakhshan, 2017, 2018). This includes disinformation (hoaxes, fake news), misinformation (errors, misleading content, bias) and mal-information (leaks, harassment, hate speech). Other scenarios are also proposed in which the treatment and analysis of the information will allow us to make a beneficial contribution to society, for example, in the context of detecting and preventing diseases, warning of possible crimes, generating constructive content or strengthening hope speech.

Task 2.2 Identification of techniques and algorithms

This task will identify different tools and algorithms that can be used to develop the software necessary to achieve the goals of the project. Taking into account the state-of-the-art, we will analyze different open source libraries in specific programming languages for the development and evaluation of machine learning models. Regarding machine learning, different algorithms to develop predictive models will be analyzed. Including for example, SVM, Random Forest or k-NN. These algorithms could be complemented with other approaches like for example deep learning architectures, including fine-tunning over semisupervised models or zero-shot learning solutions. These models could help to develop algorithms based on deep neural networks such as RNN and LSTM. In order to support this kind of implementation we will analyze different libraries like for example TensorFlow, Pytorch and Keras. In order to perform text analysis, it will be necessary to use specific NLP libraries. We will analyze different libraries to perform linguistic analysis, support different languages, provide name entity recognition, sentiment analysis or entity linking, among other possibilities. Additionally, we will try to incorporate Transformer models . These models will provide dense vector representations for words in a semantic space that can be further used in downstream NLP applications such as text classification, questions answering and entity recognition. In recent years, the use of Transformers has led to performance improvements in many NLP tasks.

Task 2.3 Determination of sources and characterisation of content

Once the identified scenarios and domains have been analysed and defined, we will use heterogeneous sources, both structured and unstructured, for the development of this part. Structured information sources refer to existing institutional and non-institutional databases. As for the unstructured sources, all kinds of digital content will be used, e.g. web content, medical literature, social networks, etc. Our focus will be on textual content without ruling out its linkage with other types of multimodal content such as images, videos, emoticons, etc. Likewise, we will focus on Spanish, but sources in other languages may also be used, considering each language as another channel of communication. In addition, we will work with the comments and interactions of digital entities, which will allow us to analyse the subjectivity and impact on communication. Importantly, both harmful and beneficial content will be characterised.