You are not logged in.

Dear visitor, welcome to QtForum.org. If this is your first visit here, please read the Help. It explains in detail how this page works. To use all features of this page, you should consider registering. Please use the registration form, to register here or read more information about the registration process. If you are already registered, please login here.

1

Wednesday, March 11th 2009, 3:57pm

Creating Web Crawlers using QT

Greetings !

We have to create an application that can read and parse a particular web site to extract some of it's content.

Can someone propose me some tutorial or at least some pre-defined set of QT classes we should start looking at ?

Classes to help connect to the site, make requests, get pages, parse the info.... etc.

Thanks !

sendevent

Beginner

  • "sendevent" is male

Posts: 8

Location: Russia, Saint-Petersburg

  • Send private message

2

Thursday, March 12th 2009, 4:12pm

There are 2 ways to access web content - QtNetwork module and QtWebKit module.
QtNetwork allow you to "low-level" control of HTTP, but in this case you need to "manually" analyze received documents and, if needed, download their parts (images, flash objects, included css and js files, etc).
QtWebKit provide more easy access - you just use some methods (load|setUrl) and it download all content for you. You can specify some options - allow/disallow auto downloading images, enabling/disabling java/javascript (yes, it's one of the major nice things - QtWebKit can process js), etc. After it you can use methods toHtml|toPlainText to get source HTML or rendered text.
But in both modules there are some problems with access to HTML document's elements - WebKit have HtmlDocument and API to manipulate it, but in QtWebKit there are no such API (I hope - yet). (I still can't find Qt's native methods for working with HTML, have to use regexps).
I think, QtWebKit - is what you need, but it depended on known only for you factors.
NB: If you'll use QtWebKit, even if your crawler is console application, you have to link with QtGui module - QtWebKit depended on it.
Нобадиз пёфект.

This post has been edited 1 times, last edit by "sendevent" (Mar 12th 2009, 4:18pm)