HarvestMan Webcrawler
-====================

Introduction
============

HARVESTMan is an internet offline crawler (robot) program written 
in python. It helps you to grab pages from the internet and store 
it in a local directory for offline browsing.

This is version 1.4 final.

Author: Anand B Pillai.

Copyright:
        All software in this distribution is 
        (C) Anand B Pillai.

License: See the file LICENSE.TXT.

Getting started 
===============
Unarchive the file to a directory of your choice. 

% tar -xjf HarvestMan-1.4-tar.bz2

Change to directory 'HarvestMan-1.4' and install
the program.

Installation steps are given below.

How to Install
==============
Make sure you are at the top-level HarvestMan
directory. For this version it is 'HarvestMan-1.4'.

On Windows, you need to run the setup.py script
to install the program. 

Open a Windows cmd shell and type the following.

% python setup.py build
% python setup.py install

On Linux/Unix, you can use the install script
named 'install'. This is a bash script. To run
it, first become the super-user by,

$ su

Then install using,

$ ./install

In Linux/Unix/Mac, it is a good idea to enable the
system wide symbolic link that will create a 
link to harvestman main module in the /usr/bin
directory as 'harvestman', so you can run the program
by typing 'harvestman' at a shell prompt.

Running the program
===================
        
The program require a configuration file to run. This
is named 'config.txt' by default. To pass a different
configuration file, use the command-line argument '-C'
or '--configfile'.

Create the config file by editing or by using the config 
file generation script provided in the 'tools' sub-directory
named 'genconfig.py'. You can also locate a sample config file
in the 'HarvestMan' directory.

If you prefer the online way, you can generate a configuration 
file by using the config generator by pointing your browser to the url, 
http://harvestman.freezope.org/configgenerator .

Windows
=======
You need to either set the PATH environment variable
to pick up the HarvestMan main module or run the program from its
installation directory. This is normally the sub-directory named
'HarvestMan' in the site-packges directory of your Python installation
folder.

Then in the command prompt,

% harvestman.py 

or

% python harvestman.py

Linux/Unix/Mac
==============

If you have enabled the system-wide symbolic link,
you can run the program just by typing 'harvestman'
at the console.

$ harvestman

Otherwise, you have to create a symbolic link to
the file 'harvestman.py' in a directory of your choice
and run the program from that directory as,

$ python harvestman.py

This will start the program using the settings in the
config file. 

Project Browse Page
===================

Upon completion, the program creates an html file for browsing
projects and opens it in the user's default web browser. This 
file is named 'index.html' and is created in the base directory 
of the program. You can click on the project link to browse directly
to the saved files. 

New project information is automatically appended
to this file.

Command line mode
=================

The program can also read configuration from the command line.
However many of the command line options are not up to date.
Hence it is advised not to use this mode of working.

For information on the command line options, run the program 
with the --help or -h option.

Project file mode
=================

HarvestMan writes a project file before it starts crawling websites.
This file has the extension '.hbp' and is written in the base 
directory of the project.

You can read this file back to restart the project later on. 

For this, use the '--projectfile' or '--PF' option and pass the project file
path as argument. This reruns a previously ran project.

The Config file
===============

The config file provides the program with its settings. It
contains name value pairs separated by spaces/tabs. For example,

project.url     	http://www.python.org/doc/
project.basedir     	d:/websites
project.name		pydoc    

You can specify comments by prepending a line with the hash ('#')
or double semicolon (';;') characters. A comment can be in a line
by itself or at the end of a configuration line.

# Url to download
project.url		http://www.python.org/doc/
project.basedir     	d:/websites     ;; This is the base directory

The new version of the config file separates config variables into
9 different sections as described below.

Section                       Description

1. project                    All project related variables
2. network                    All network related variables lik proxy,
                              proxy username/password etc.
3. url                        Any username/password for the url
4. download                   All download related variables (html/image/
                              stylesheets/cookies etc)
5. control                    All download control variables (filters/
                              maximum limits/timeouts/depths/robots.txt)
6. system                     Any system related variable( fastmode/thread status/
                              thread timeouts/thread pool size etc)
7. indexer                    All indexer related variables (localize etc)
8. files                      All harvestman file settings (config/message log/ 
                              error log/url list file etc) 
9.display                    Display (GUI/browser) related setting
  
HarvestMan accepts 50 plus different configuration options.

For a detailed discussion on the options, refer the HarvestMan 
documentation files in the 'doc' sub-directory or point your browser
to http://harvestman.freezope.org/configoptions.html .

Sample config file 
==================

A part of a sample HarvestMan config file (version 2.0) is shown
below. 

;;HarvestMan Configuration File version 2.0

;;project related variables
project.url                        www.python.org/doc/current/tut/tut.html
project.name                       pytut
project.basedir                    d:/websites
project.verbosity                  2

The actual config file is much larger than this and contains various
config sections.

A script genconfig.py is provided to generate a config file
based on inputs from the user.

Platforms
=========

This HarvestMan version has been developed on Python 2.3, specifically
Python 2.3.2 and higher.

It is preferred to run HarvestMan with the latest stable release of Python,
to get the benefits of all features. Right now this is Python 2.3.
For example, html tidying feature will work only with Python 2.3.

The minimal requirement is any version of Python 2.2. 

HarvestMan should work on all platforms where Python is supported.
It has been specifically tested on Windows NT/2000, Redhat Linux 9.0,
Fedora Core 1 & Fedora Core 2, Mandrake Linux 9.1 etc. 

HarvestMan has some performance problems on Windows XP platform.

You can use the script 'check_dep.py' to check dependencies.

HTML Tidy - Configuration
=========================

From version 1.3.9 onwards, HarvestMan supports cleaning
of webpages before they are parsed to avoid parse errors which helps to
download more web pages. HarvestMan uses the python wrapper of html tidy
called uTidyLib. 

This version of HarvestMan comes with the latest version of uTidy in the
package. Tidy source code is located inside the sub-directory 'tidy'
inside the 'HarvestMan' directory. Hence HarvestMan will work transparently
with tidy. 

The tidy option can be enabled or disabled by using the config variable
'control.tidyhtml'. Tidy has been disabled by default in this version.

Note that for using tidy, Python 2.3 or higher version is required.

More Documentation
==================

Read the HarvestMan documentation in the 'doc' sub-directory for
more information. More information is also available in the project
web page.

Changes & Fix History
=====================
	
See the file Changes.txt.

Change Log for this Version
===========================

See the file ChangeLog.txt.

Website
=======

http://harvestman.freezope.org .


=======================================================================
Bangalore,
Dec 16 2004.




    

    
    

