Tag Archives: data scraping

Livro “Scraping for Journalists”: como recolher e analisar dados

Se me tivessem dito há uns anos atrás que para ser um jornalista moderno teria que saber trabalhar com o Excel e bases de dados provavelmente teria escolhido outra área.

Saber trabalhar com dados é uma das características mais importantes dos jornalistas digitais, pensem só na quantidade de informação estatística que está aí ao nosso dispor, em quantidades tão avassaladoras que é difícil perceber onde está a história e que outras histórias se podem encontrar.

Para nos ajudar a encontrar estas agulhas no palheiro da informação digital Paul Bradshaw escreveu um livro muito prático sobre como inquirir fontes (normalmente, bases de dados) e analisar os resultados de forma eficaz.Apesar de estar orientado para uma realidade anglo-saxónica, onde a disponibilização e organização de dados estar muito à frente da portuguesa, é uma excelente forma de entrar nesta área.

Tem alguma componente de programação, por isso não será para todos, é preciso saber mais do que escrever e editar no mundo digital onde vivemos.Podem ver aqui um excerto em PDF.

“Scraping for Journalists introduces you to a range of scraping techniques – from very simple scraping techniques which are no more complicated than a spreadsheet formula, to more complex challenges such as scraping databases or hundreds of documents. At every stage you’ll see results – but you’ll also be building towards more ambitious and powerful tools.

You’ll be scraping within 5 minutes of reading the first chapter – but more importantly you’ll be learning key principles and techniques for dealing with scraping problems.”

Scraping for Journalists

Paul Bradshaw foi o meu professor no Mestrado de Jornalismo Online em Birmingham.



Portuguese data scraping experiment: parliament activities and YQL

It looks simple and effective, and something like this should have been made before by a newspaper or other news outlet. But, once again, it’s the geek community that steps forward and does a useful tool using public data (although I heard recently that a political editor working in an important media institution said that the deputies attendance was a “state secret”. Yeah, right…).  And why ? Just for the fun of it!

List of deputies by electoral circles

The mind behind this is experiment is Luis Confraria, that works as a front end developer at Outbox Ativism, a company that specializes in digital and web products for companies and institutions. So he’s not a journalist, but he decided to experiment with some tools to create what can be considered as a journalistic product. I asked him why: “Well my main motivation was.. fun! Also ,I liked the idea of building upon some  government website that indeed contained all the data but not in the most useful way. Besides, I really wanted to learn and test some stuff i used in it.

The project is based on the Portuguese Parliament website, and shows all the current deputies by party or electoral circle, and the profile and participation of each one of them. The original information is not organized in a easy to use way, so Luis resorted to some data scraping techniques. “At first I started scraping the data with a little python script but then I went with yql. Made a few open tables, and pushed them to github. On the client side i just used simple html / css / javascript with jquery, sammy.js.

There aren’t many open-data or data scraping projects in Portugal, and most of the ones I know are created by non-journalists, and he shared the example of  transparencia-pt.org. Is there a lack of programmers in Portuguese journalism? I think so, but Luis goes even further: “There is probably a lack of programmers in society at large 🙂 not just in journalism.” Maybe journalists are  just not having fun with their work.

I’m optimistic about this. We will see more and more of these projects because there are better and faster tools each day and because people do really care.” To which he adds: “The most important is to have public data in the simplest format possible. The rest will come naturally.

I’m not as optimistic as he is, but I agree: there is a need, and things will find its own course. But it was hard to find space for videographers in the newsrooms, so I’m not confident about the future of programmers in the new newsrooms. At least in most of them. Now that he did this, what comes next?

Probably i will tweak and fix it some more. (It still looks like crap on ie). I hope someone builds something else with the yql tables. I have some ideas popping now and then but nothing mature enough.”

We’ll be looking out for them.

Luis shared the YQL tables, so have fun:

and here is the code for the tables:
Detailed information for each deputy: attendance, participation and profile