Sometimes for my work, I find tables in HTML files on the web that I need to process.
There are of course many ways to do it, you can cut and paste them from your browser (which leaves you with tab-separated text files that you can process with awk
), you can also process the HTML directly with things like BeautifulSoup
.
A solution that works quite well is to convert the HTML tables to org-mode tables, and then use org-mode on emacs (this assumes you use emacs of course).
Pandoc makes it easy to do so. Just type
pandoc -t org toto.html -o toto.org
You can use for instance to process the current file. In this case, the following table:
Company | Contact | Country |
---|---|---|
Alfreds Futterkiste | Maria Anders | Germany |
Centro comercial Moctezuma | Francisco Chang | Mexico |
will become
| Company | Contact | Country |
|----------------------------+-----------------+---------|
| Alfreds Futterkiste | Maria Anders | Germany |
| Centro comercial Moctezuma | Francisco Chang | Mexico |
It’s not always perfect (sometimes headers are not considered as headers), but that’s usually all you need to process it with org-mode.