Algoritmos para el reconocimiento de estructuras de tablas

Yosveni Escalona

doi:10.17163/ings.n25.2021.05

PDF (Spanish) PDF EPUB (Spanish) EPUB HTML (Spanish)

Published: 2020-12-31

DOI: https://doi.org/10.17163/ings.n25.2021.05

Keywords:

Tabular Data, HTML Tables, Spreadsheets, Conditional Random Fields, Machine Learning, Algorithm

Yosveni Escalona

http://orcid.org/0000-0003-2992-0540

Abstract

Tables are widely adopted to organize and publish data. For example, the Web has an enormous number of tables, published in HTML, embedded in PDF documents, or that can be simply downloaded from Web pages. However, tables are not always easy to interpret due to the variety of features and formats used. Indeed, a large number of methods and tools have been developed to interpreted tables. This work presents the implementation of an algorithm, based on Conditional Random Fields (CRFs), to classify the rows of a table as header rows, data rows or metadata rows. The implementation is complemented by two algorithms for table recognition in a spreadsheet document, respectively based on rules and on region detection. Finally, the work describes the results and the benefits obtained by applying the implemented algorithm to HTML tables, obtained from the Web, and to spreadsheet tables, downloaded from the Brazilian National Petroleum Agency.

Issue

No. 25 (2021): january-june

Section

Scientific Paper

The Universidad Politécnica Salesiana of Ecuador preserves the copyrights of the published works and will favor the reuse of the works. The works are published in the electronic edition of the journal under a Creative Commons Attribution/Noncommercial-No Derivative Works 4.0 Ecuador license: they can be copied, used, disseminated, transmitted and publicly displayed.

The undersigned author partially transfers the copyrights of this work to the Universidad Politécnica Salesiana of Ecuador for printed editions.

It is also stated that they have respected the ethical principles of research and are free from any conflict of interest. The author(s) certify that this work has not been published, nor is it under consideration for publication in any other journal or editorial work.

The author (s) are responsible for their content and have contributed to the conception, design and completion of the work, analysis and interpretation of data, and to have participated in the writing of the text and its revisions, as well as in the approval of the version which is finally referred to as an attachment.

References

[1] M. Yakout, K. Ganjam, K. Chakrabarti, and S. Chaudhuri, “Infogather: Entity augmentation and attribute discovery by holistic matching with web tables,” in Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, ser. SIGMOD ’12. New York, NY, USA: Association for Computing Machinery, 2012, pp. 97–108. [Online]. Available: https://doi.org/10.1145/2213836.2213848
[2] M. J. Cafarella, A. Halevy, D. Z. Wang, E. Wu, and Y. Zhang, “Webtables: Exploring the power of tables on the web,” Proc. VLDB Endow., vol. 1, no. 1, pp. 538–549, Aug. 2008. [Online]. Available: https://doi.org/10.14778/1453856.1453916
[3] E. Koci, M. Thiele, O. Romero, and W. Lehner, “Table identification and reconstruction in spreadsheets,” in Advanced Information Systems Engineering, E. Dubois and K. Pohl, Eds. Cham: Springer International Publishing, 2017, pp. 527–541.
[4] P. Venetis, A. Halevy, J. Madhavan, M. Pasca, W. Shen, F. Wu, G. Miao, and C. Wu, “Recovering semantics of tables on the web,” Proc. VLDB Endow., vol. 4, no. 9, pp. 528–538, Jun. 2011. [Online]. Available: https://doi.org/10.14778/2002938.2002939
[5] G. Limaye, S. Sarawagi, and S. Chakrabarti, “Annotating and searching web tables using entities, types and relationships,” Proc. VLDB Endow., vol. 3, no. 1–2, pp. 1338–1347, Sep. 2010. [Online]. Available: https://doi.org/10.14778/1920841.1921005
[6] T. F. Varish Mulwad and A. Joshi, “Generating Linked Data by Inferring the Semantics of Tables,” in Proceedings of the First International Workshop on Searching and Integrating New Web Data Sources, September 2011, co-located with VLDB 2011. [Online]. Available: https://bit.ly/3p8s1q0
[7] A. S. Corrêa and P.-O. Zander, “Unleashing tabular content to open data: A survey on pdf table extraction methods and tools,” in Proceedings of the 18th Annual International Conference on Digital Government Research, ser. dg.o ’17. New York, NY, USA: Association for Computing Machinery, 2017, pp. 54–63. [Online]. Available: https://doi.org/10.1145/3085228.3085278
[8] B. Yildiz, K. Kaiser, and S. Miksch, “pdf2table: A method to extract table information from pdf files.” [Online]. Available: https://bit.ly/3k2ejBa
[9] Y. Liu, P. Mitra, and C. L. Giles, “Identifying table boundaries in digital documents via sparse line detection,” in CIKM ’08, 2008. [Online]. Available: https://bit.ly/369nWcm
[10] T. Kieninger, “Table structure recognition based on robust block segmentation,” 1998, pp. 22–32. [Online]. Available: https://bit.ly/38k4YT9
[11] M. Zhang and K. Chakrabarti, “Infogather+: Semantic matching and annotation of numeric and time-varying attributes in web tables,” in Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, ser. SIGMOD ’13. New York, NY, USA: Association for Computing Machinery, 2013, pp. 145–156. [Online]. Available: https://doi.org/10.1145/2463676.2465276
[12] Z. Zhang, “Towards efficient and effective semantic table interpretation,” in The Semantic Web – ISWC 2014, P. Mika, T. Tudorache, A. Bernstein, C. Welty, C. Knoblock, D. Vrandecic, P. Groth, N. Noy, K. Janowicz, and C. Goble, Eds. Cham: Springer International Publishing, 2014, pp. 487–502. [Online]. Available: https://doi.org/10.1007/978-3-319-11964-9_31
[13] H. Masuda and S. Tsukamoto, “Recognition of html table structure,” 2004. [Online]. Available: https://bit.ly/3p8xL2Q [14] J. Fang, P. Mitra, Z. Tang, and C. L. Giles, “Table header detection and classification,” in AAAI, 2012. [Online]. Available: https://bit.ly/2IcT3vy
[15] D. Pinto, A. McCallum, X. Wei, and W. B. Croft, “Table extraction using conditional random fields,” in Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, ser. SIGIR ’03. New York, NY, USA: Association for Computing Machinery, 2003, pp. 235–242. [Online]. Available: https://doi.org/10.1145/860435.860479
[16] I. A. Doush and E. Pontelli, “Detecting and recognizing tables in spreadsheets,” in Proceedings of the 9th IAPR International Workshop on Document Analysis Systems, ser. DAS ’10. New York, NY, USA: Association for Computing Machinery, 2010, pp. 471–478. [Online]. Available: https://doi.org/10.1145/1815330.1815391
[17] E. Koci, M. Thiele, W. Lehner, and O. Romero, “Table recognition in spreadsheets via a graph representation,” in 2018 13th IAPR International Workshop on Document Analysis Systems (DAS), 2018, pp. 139–144. [Online]. Available: https://doi.org/10.1109/DAS.2018.48
[18] J. D. Lafferty, A. McCallum, and F. C. N. Pereira, “Conditional random fields: Probabilistic models for segmenting and labeling sequence data,” in Proceedings of the Eighteenth International Conference on Machine Learning, ser. ICML ’01. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2001, pp. 282–289. [Online]. Available: https://bit.ly/3lbW1yE
[19] J. L. Solé, Book review: Pattern recognition and machine learning. Cristopher M. Bishop. Information Science and Statistics. Springer, 2007. [Online]. Available: https://bit.ly/3l7doRq
[20] M. D. Adelfio and H. Samet, “Schema extraction for tabular data on the web,” Proc. VLDB Endow., vol. 6, no. 6, pp. 421–432, Apr. 2013. [Online]. Available: https://doi.org/10.14778/2536336.2536343

Article Sidebar

Main Article Content

Abstract

Article Details

References