Posts by Collection

portfolio

publications

Portal Web de Analíticas de Uso para Cuentas Compartidas en AWS

Published in Trabajo Fin de Grado - Grado en Ingeniería Informática, 2018

La adopción de proveedores de Cloud público como Amazon Web Services (AWS), tanto en ámbitos académicos como empresariales, requiere, por lo general, el uso compartido de cuentas de usuario entre diferentes miembros de la institución. Es importante conocer el grado de utilización que los usuarios realizan de la plataforma, especialmente cuando se integra una plataforma Cloud en el ámbito educativo, donde es necesario conocer el grado de uso que los alumnos han hecho de los diferentes servicios de AWS. Este Trabajo de Fin de Grado se ha centrado en el desarrollo de una aplicación distribuida, denominada CloudTrail-Tracker, para la recopilación, procesado y análisis de eventos que permite obtener la actividad de una cuenta de usuario de AWS, permitiendo conocer las actividades que han generado cambios en la infraestructura, mostrando por usuarios, servicios o añadiendo filtros más personalizados. CloudTrail-Tracker se ha diseñado como una aplicación serverless de manera que se ejecuta completamente usando servicios de AWS y computación dirigida por eventos de manera que no es necesaria la gestión explícita de servidores. La aplicación ha sido liberada a la comunidad mediante una licencia de código abierto y está siendo utilizada actualmente en producción para soportar el análisis de las actividades de los alumnos en asignaturas de Cloud Computing de tres másteres de la Universitat Politècnica de València. Se ha realizado un análisis de rendimiento y de coste de la herramienta, determinando que puede ser una solución efectiva al análisis del uso de cuentas de AWS tanto en términos de coste económico como en aplicabilidad a múltiples escenarios, no únicamente educativos.

Recommended citation: Prieto Fontcuberta, JR. (2018). Portal Web de Analíticas de Uso para Cuentas Compartidas en AWS. http://hdl.handle.net/10251/106685 https://riunet.upv.es/handle/10251/106685

Bots and Gender Profiling using a Deep Learning Approach

Published in CLEF (Working Notes) 2019, 2019

This paper describes the system we developed for the Bots and gender profiling task, at PAN@ CLEF 2019. The task consists in, given a tweets set, automatically determine whether its author is a bot or a human. In case of human, identify her/his gender. We propose a deep learning based system, fed with the TFIDF representation from the texts instead of word embeddings representation as usual. Additionally, we use some linguistic features which improve the performance of the system according with the experimental results.

Recommended citation: JR Prieto Fontcuberta, GL De la Peña Sarracén - CLEF (Working Notes), 2019 http://ceur-ws.org/Vol-2380/paper_221.pdf

Análisis de Maquetación (‘Layout’) en imágenes de texto manuscrito mediante Redes Neuronales

Published in Trabajo Fin de Máster - Máster Universitario en Inteligencia Artificial, Reconocimiento de Formas e Imagen Digital, 2019

Existen grandes colecciones de manuscritos, las cuales contienen información muy valiosa sobre aspectos cruciales de la historia de nuestra sociedad. Existe tal cantidad de documentos que de forma manual se tardarían años, o incluso siglos, en poder extraer toda la información, cuya mayoría es textual. Debido a esto, se trata de utilizar técnicas de maquetación y reconocimiento de texto manuscrito de las imágenes de forma automática a fin de poder comprender mejor, y de manera m´as eficiente, la información que nos proporcionan estas colecciones. Este Trabajo de Fin de Máster se ha centrado en el desarrollo y evaluación de diferentes técnicas de aprendizaje profundo para realizar la maquetación de páginas con alto valor histórico. Por lo que este trabajo gira en torno a dos tareas. La primera, la segmentación de zonas en un corpus del siglo XIV al siglo XIX. Dicho corpus está compuesto mayormente por tablas, habilitando un posterior análisis para permitir realizar consultas estructuradas. La segunda tarea trata de la separación de registros en una colección del siglo XIV al siglo XV dictados por el rey de Francia. Dicha separación ayudaría a la búsqueda de temas concretos de la época, así como posibles sentencias escritas en dichos registros. Además, se ha utilizado la información textual disponible en ambas colecciones para fusionarla con la información gráfica de la página y analizar así su impacto sobre los resultados. Tras experimentar con diferentes arquitecturas de redes convolucionales, se han mejorado los resultados base en una de las tareas. Por otro lado, la información textual extraída del contenido textual de los documentos ha ayudado a obtener mejoras en los resultados en ambas tareas.

Recommended citation: Prieto Fontcuberta, JR. (2019). PAnálisis de Maquetación ("Layout") en imágenes de texto manuscrito mediante Redes Neuronales. https://riunet.upv.es/handle/10251/130017 https://riunet.upv.es/handle/10251/130017

Herramienta web para el seguimiento automatizado de actividades educativas prácticas en la nube

Published in Actas de las Jornadas sobre Enseñanza Universitaria de la Informática 2019, 2019

Esta contribución presenta un recurso docente destinado a la recopilación y análisis automatizado de evidencias generadas en actividades educativas prácticas en la nube, ejemplificado para Amazon Web Services (AWS). Incluye una arquitectura que posibilita la captura de datos y procesado utilizando servicios en la nube, así como un panel web de control educativo donde profesor y alumnos pueden consultar información relativa al uso de los diferentes servicios de AWS. Además, permite la autorregulación de los estudiantes proporcionándoles información sobre el porcentaje de progreso de cada sesión de laboratorio y las acciones que faltan para culminar cada práctica. La herramienta permite extraer automáticamente analíticas de aprendizaje en base a dichos datos, que permiten evidenciar el grado de desarrollo de una práctica para un alumno concreto. También obtiene información agregada sobre el uso de recursos en AWS de diferentes alumnos a lo largo de un curso académico. La herramienta, que se ha liberado a la comunidad como código abierto, se está utilizando en producción en tres másteres y un curso online de formación en AWS, y puede ser aplicada en entornos educativos que involucren el uso de este proveedor Cloud.

Recommended citation: Germán Moltó, Diana M Naranjo y José Ramón Prieto. «Herramienta web para el seguimiento automatizado de actividades educativas prácticas en la nube». En: Actas de las JENUI. 2019, págs. 175-182. https://aenui.org/actas/pdf/JENUI_2019_031.pdf

A visual dashboard to track learning analytics for educational cloud computing

Published in Sensors 2019, 19(13), 2952, 2019

Cloud providers such as Amazon Web Services (AWS) stand out as useful platforms to teach distributed computing concepts as well as the development of Cloud-native scalable application architectures on real-world infrastructures. Instructors can benefit from high-level tools to track the progress of students during their learning paths on the Cloud, and this information can be disclosed via educational dashboards for students to understand their progress through the practical activities. To this aim, this paper introduces CloudTrail-Tracker, an open-source platform to obtain enhanced usage analytics from a shared AWS account. The tool provides the instructor with a visual dashboard that depicts the aggregated usage of resources by all the students during a certain time frame and the specific use of AWS for a specific student. To facilitate self-regulation of students, the dashboard also depicts the percentage of progress for each lab session and the pending actions by the student. The dashboard has been integrated in four Cloud subjects that use different learning methodologies (from face-to-face to online learning) and the students positively highlight the usefulness of the tool for Cloud instruction in AWS. This automated procurement of evidences of student activity on the Cloud results in close to real-time learning analytics useful both for semi-automated assessment and student self-awareness of their own training progress.

Recommended citation: Naranjo, D.M.; Prieto, J.R.; Moltó, G.; Calatrava, A. A Visual Dashboard to Track Learning Analytics for Educational Cloud Computing. Sensors 2019, 19, 2952. https://doi.org/10.3390/s19132952 https://doi.org/10.3390/s19132952

Textual-Content-Based Classification of Bundles of Untranscribed Manuscript Images

Published in International Conference on Pattern Recognition (ICPR 2020) , 2020

Content-based classification of manuscripts is an important task that is generally performed in archives and libraries by experts with a wealth of knowledge on the manuscript’s contents. Unfortunately, many manuscript collections are so vast that it is not feasible to rely solely on experts to perform this task. Current approaches for textual-content-based manuscript classification generally require the handwritten images to be first transcribed into text – but achieving sufficiently accurate transcripts are generally unfeasible for large sets of historical manuscripts. We propose a new approach to perform automatically this classification task which does not rely on any explicit image transcripts. It is based on “probabilistic indexing”, a relatively novel technology which allows to effectively represent the intrinsic word-level uncertainty generally exhibited by handwritten text images. We assess the performance of this approach on a large collection of complex manuscripts from the Spanish Archivo General de Indias, with promising results. To the best of our knowledge, this is the first published work proposing, developing and assessing a successful approach for content-based classification of untranscribed manuscript images.

Recommended citation: Prieto, J. R., Bosch, V., Vidal, E., Alonso, C., Orcero, M. C., & Marquez, L. (2020). Textual-Content-Based Classification of Bundles of Untranscribed Manuscript Images. International Conference on Pattern Recognition (ICPR), 3162–3169.

Writer Identification Using Deep Neural Networks : Impact of Patch Size and Number of Patches

Published in International Conference on Pattern Recognition (ICPR 2020) , 2020

Traditional approaches for the recognition or identification of the writer of a handwritten text image used to relay on heuristic knowledge about the shape and other features of the strokes of previously segmented characters. However, recent works have done significantly advances on the state of the art thanks to the use of various types of deep neural networks. In most of all of these works, text images are decomposed into patches, which are processed by the networks without any previous character or word segmentation. In this paper, we study how the way images are decomposed into patches impact recognition accuracy, using three publicly available datasets. The study also includes a simpler architecture where no patches are used at all – a single deep neural network inputs a whole text image and directly provides a writer recognition hypothesis. Results show that bigger patches generally lead to improved accuracy, achieving in one of the datasets a significant improvement over the best results reported so far.

Recommended citation: Punjabi, A., Ram, J., & Vidal, E. (2021). Writer Identification Using Deep Neural Networks : Impact of Patch Size and Number of Patches. 9764–9771.

The Carabela project and manuscript collection: large-scale probabilistic indexing and content-based classification

Published in 2020 17th International Conference on Frontiers in Handwriting Recognition (ICFHR), 2020

The main aim of the Carabela project was to develop and apply techniques that allow textual searching on massive Spanish collections of 15th-19th century manuscripts. The project focused on a relatively small subset of 125 000 images of collections of interest to underwater archaeology. For this type of manuscripts, state-of-the-art automatic transcription techniques, generally fail to achieve usable transcription accuracy. Therefore, rather than insisting in actual transcription, methodologies for probabilistic indexing of handwritten text images have been adopted. This has allowed us to effectively cope with the intrinsically high degree of uncertainty of the text contained in most historical manuscripts, leading to highly effective systems for textual search and retrieval. Carabela has gone one step further by developing new techniques to classify probabilistically indexed, but otherwise untranscribed, text images according to their textual content. These techniques have been successfully used to automatically classify Carabela bundels (each containing hundreds or thousands of pages) according to their “level of risk” of public exposure, in order to control their access and avoid as much as possible the plundering of Spanish underwater heritage.

Recommended citation: E. Vidal et al., "The Carabela Project and Manuscript Collection: Large-Scale Probabilistic Indexing and Content-based Classification," 2020 17th International Conference on Frontiers in Handwriting Recognition (ICFHR), 2020, pp. 85-90, doi: 10.1109/ICFHR2020.2020.00026. https://ieeexplore.ieee.org/abstract/document/9257622

Text Content Based Layout Analysis

Published in Proceedings of International Conference on Frontiers in Handwriting Recognition, ICFHR 2020, 2020

State-of-the-art Document Layout Analysis methods rely on graphical appearance features in order to detect and classify the different layout regions present in a scanned text image. In many cases, however, performing this task using only graphical information is problematic or impossible. Only by actually reading some text in the boundaries of the problematic regions it becomes possible to reliably detect and separate these regions. In these situations, textual, content-based features would be required, but since transcription is usually performed after layout analysis, a vicious circle arises. In this work, we circumvent this deadlock by making use of the recently introduced concept of Probabilistic Index Map. We use the word relevance probabilities provided by this map to calculate relevant text content based features at the pixel level. We assess the impact of these new features on a historical document complex paragraph classification task. The experiments are performed using both a classical Hidden Markov Model approach and Deep Neural Networks. The obtained results are encouraging and showcase the positive impact text content based features will have on the Document Layout Analysis research field.

Recommended citation: Prieto, J. R., Bosch, V., Vidal, E., Stutzmann, D., & Hamel, S. (2020). Text Content Based Layout Analysis. Proceedings of International Conference on Frontiers in Handwriting Recognition, ICFHR, 2020-September, 258–263. https://doi.org/10.1109/ICFHR2020.2020.00055 https://doi.org/10.1109/ICFHR2020.2020.00055

Improved graph methods for table layout understanding

Published in ICDAR 2021: Document Analysis and Recognition – ICDAR 2021, 2021

Recently, there have been significant advances in document layout analysis and, particularly, in the recognition and understanding of tables and other structured documents in handwritten historical texts. In this work, a series of improvements over current techniques based on graph neural networks are proposed, which considerably improve state-of-the-art results. In addition, a two-pass approach is also proposed where two graph neural networks are sequentially used to provide further substantial improvements of more than 12 F-measure points in some tasks. The code developed for this work will be published to facilitate the reproduction of the results and possible improvements.

Recommended citation: Prieto, J.R., Vidal, E. (2021). Improved Graph Methods for Table Layout Understanding. In: Lladós, J., Lopresti, D., Uchida, S. (eds) Document Analysis and Recognition – ICDAR 2021. ICDAR 2021. Lecture Notes in Computer Science(), vol 12822. Springer, Cham. https://doi.org/10.1007/978-3-030-86331-9_33 https://doi.org/10.1007/978-3-030-86331-9_33

Fusion of Visual and Textual Features for Table Header Detection in Handwritten Text Images

Published in 2021 International Conference on Computational Science and Computational Intelligence (CSCI), 2021

This paper introduces a new procedure to improve table header detection in handwritten text images from the fusion of the posterior probabilities provided by two baseline classifiers. Each classifier considers a different modality, namely visual or textual features. Both baseline classifiers implements convolutional neural networks, particularly adopting the U-Net architecture. Four fusion methods are considered: the mean; linear discriminant analysis and random forest as meta-classifiers; and a recently developed method called alpha integration. The testing dataset consisted of 89 page images drawn from the Passau dataset. The improved performance provided by the fusion methods in the specific experiments is interesting considering the complexity of the challenging problem approached. In terms of area under the receiver operating characteristic curve the best results were obtained by alpha integration. This method incorporates least mean square parameter optimization. The improvement is relevant in the context of the targeted problem.

Recommended citation: A. Salazar, J. R. Prieto, E. Vidal, G. Safont and L. Vergara, "Fusion of Visual and Textual Features for Table Header Detection in Handwritten Text Images," 2021 International Conference on Computational Science and Computational Intelligence (CSCI), 2021, pp. 1560-1566, doi: 10.1109/CSCI54926.2021.00304. 10.1109/CSCI54926.2021.00304

Classification of untranscribed handwritten notarial documents by textual contents

Published in IbPRIA 2022: Pattern Recognition and Image Analysis, 2022

Huge amounts of digital page images of important manuscripts are preserved in archives worldwide. The amounts are so large that it is generally unfeasible for archivists to adequately tag most of the documents with the required metadata so as to allow proper organization of the archives and effective exploration by scholars and the general public. The class or “typology” of a document is perhaps the most important tag to be included in the metadata. The technical problem is one of automatic classification of documents, each consisting of a set of untranscribed handwritten text images, by the textual contents of the images. The approach considered is based on “probabilistic indexing”, a relatively novel technology which allows to effectively represent the intrinsic word-level uncertainty exhibited by handwritten text images. We assess the performance of this approach on a large collection of complex notarial manuscripts from the Spanish Archivo Histórico Provincial de Cádiz, with promising results.

Recommended citation: Flores, J.J., Prieto, J.R., Garrido, D., Alonso, C., Vidal, E. (2022). Classification of Untranscribed Handwritten Notarial Documents by Textual Contents. In: Pinho, A.J., Georgieva, P., Teixeira, L.F., Sánchez, J.A. (eds) Pattern Recognition and Image Analysis. IbPRIA 2022. Lecture Notes in Computer Science, vol 13256. Springer, Cham. https://doi.org/10.1007/978-3-031-04881-4_2 https://doi.org/10.1007/978-3-031-04881-4_2

Extracting descriptive words from untranscribed handwritten images

Published in IbPRIA 2022: Pattern Recognition and Image Analysis, 2022

Extracting descriptive text from manuscripts to be included in the manuscript metadata is an important task that is generally performed in archives and libraries by experts with a wealth of knowledge on the manuscripts contents. Unfortunately, many manuscript collections are so vast that it is not feasible to rely solely on experts to perform this task. To our knowledge, this is the first work aiming at automatic extraction of descriptive text from untranscribed text images. To attempt dealing with such a task, a first step would be to transcribe the handwritten images into text – but achieving sufficiently accurate transcripts is generally unfeasible for large sets of historical manuscripts. We propose new approaches to automatically extract descriptive words which do not rely on any explicit image transcripts. They are based on “probabilistic indexing”, a relatively novel technology which allows to effectively represent the intrinsic word-level uncertainty generally exhibited by handwritten text images. We assess the performance of this approach on samples of a large collection of complex manuscripts from the Spanish Archivo General de Indias. Since no standard metrics exist for the novel task considered in this work, we propose two new evaluation measures which aim at measuring the quality of the detected descriptive words in terms close to practical usage of these words. Using these metrics we report promising preliminary results.

Recommended citation: Prieto, J.R., Vidal, E., Sánchez, J.A., Alonso, C., Garrido, D. (2022). Extracting Descriptive Words from Untranscribed Handwritten Images. In: Pinho, A.J., Georgieva, P., Teixeira, L.F., Sánchez, J.A. (eds) Pattern Recognition and Image Analysis. IbPRIA 2022. Lecture Notes in Computer Science, vol 13256. Springer, Cham. https://doi.org/10.1007/978-3-031-04881- https://doi.org/10.1007/978-3-031-04881-4_43

Information Extraction from Handwritten Tables in Historical Documents

Published in DAS 2022: Document Analysis Systems, 2022

Recently, significant advances have been made in Document Understanding in structured historical documents. However, not much research has been done in information extraction from handwritten structured historical documents. In this paper, we compare two Machine Learning approaches and another approach that is based on heuristic rules to extract information in historical pre-printed forms with handwritten information. We analyze how each approach performs at each step of the extraction process. The proposed approaches improve the heuristic-rule baseline by up to 0.14 F-measure points throughout the information extraction pipeline.

Recommended citation: Andrés, J., Prieto, J.R., Granell, E., Romero, V., Sánchez, J.A., Vidal, E. (2022). Information Extraction from Handwritten Tables in Historical Documents. In: Uchida, S., Barney, E., Eglin, V. (eds) Document Analysis Systems. DAS 2022. Lecture Notes in Computer Science, vol 13237. Springer, Cham. https://doi.org/10.1007/978-3-031-06555-2_13 https://doi.org/10.1007/978-3-031-06555-2_13

talks

teaching

Data Structures and Algorithms

Undergraduate course, Universitat Politècnica de València (UPV), Escuela Técnica Superior de Informática (ETSINF), Departmento de Sistema Informáticos y Computación (DSIC), 2021

Machine Learning

Undergraduate course, Universitat Politècnica de València (UPV), Escuela Técnica Superior de Informática (ETSINF), Departmento de Sistema Informáticos y Computación (DSIC), 2021

Programming

Undergraduate course, Universitat Politècnica de València (UPV), Escuela Técnica Superior de Informática (ETSINF), Departmento de Sistema Informáticos y Computación (DSIC), 2022