This past semester for a Computer Architecture class I was taking, we were assigned a project to implement a distributed search engine using Nutch. The following notes describe the installation procedure for installing and configuring Nutch as a distributed search engine on Windows XP using Cygwin (a Linux emulator) and Apache Tomcat. Note that the master computer was running Windows XP while the slave machines were running Mac OS X, showing that the search engine works across multiple platforms.
What do you mean by “Distributed”?
Nutch works great as a single-node architecture, but fetching and indexing takes a long time to complete. Do you really want your master node (which serves the front-end GUI search engine to clients) to spend time fetching and indexing possibly millions of pages? The answer is no
Ideally, you would have slave nodes doing the fetching and indexing for you, while the master node simply handles the queries (and maybe even multiple machines handling queries, but this tutorial does not focus on that). Our architecture looked like this:
Prerequisites
Java
Nutch requires Java. For this project, we downloaded and installed Java SE JDK 1.6 Update 10 for PC and used the existing JDK 1.5.0 native installation on the Macs. For each machine, we had to set the path to Java accordingly in the operating system being used. On the master running Cygwin, this path was:
export JAVA_HOME=’/cygdrive/c/program files/java/jdk1.6.0_10’
For Mac we used the following path:
export JAVA_HOME=’/System/Library/Frameworks/JavaVM.framework/Versions/1.5.0/Home’
Nutch
We used Nutch 0.9 and placed the files in the root of our file system. On the master, this was:
C:\nutch-0.9.
Within Cygwin, this directory is:
/cygdrive/c/nutch-0.9/
For the Macs we placed the Nutch folder on the desktop with the path:
/Users/your_user_name/Desktop/nutch-0.9/
Apache Tomcat
We used Apache Tomcat 6.0.18 (Core) for the master. Slave machines do not require Tomcat, but they may be installed in case the master becomes unavailable (and one of the slaves can then be configured to be the master).
Cygwin
Lastly, we used Cygwin 1.5.25-15 on the master. Since the slave machines were running Mac OS X, we did not need a Linux-like environment for those machines. The default packages were installed. It should be noted that whenever Cygwin is started, it is required to set the JAVA_HOME environmental variable (shown above).
Modifications
Nutch
Before we can begin crawling, indexing and searching, we have to provide Nutch with a list of websites. To do this, we created a folder called “urls” in the Nutch directory, and within this folder, we created a file called “urls.txt”. This file contains a number of websites we used for the project, separated line-by-line. Each machine has different websites in this file since we are doing a distributed search.
Within the /conf/ directory, there is a file called crawl-urlfilter.txt which also must be modified to include the above websites (domain names only).
The next modification made was to the /conf/nutch-site.xml file. This change is optional, but we modified the XML data to list our own agent name, description, URL, and e-mail address for our “spider.” This information becomes available to the web server’s logs of which we are indexing.
Next, we changed the /conf/hadoop-env.sh file to include our JAVA_HOME variable and to configure the paths for our log files and the file that contains a list of the slave machines. We added the following lines to this file:
export JAVA_HOME=/cygdrive/c/Progra~1/Java/jdk1.6.0_10
export HADOOP_LOG_DIR=/cygdrive/c/nutch/search/logs
export HADOOP_SLAVES=/cygdrive/c/nutch/search/conf/slaves
For Mac we added the following lines to this file:
export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.5.0/Home
export HADOOP_LOG_DIR=/Users/your_user_name/Desktop/search/logs
export HADOOP_SLAVES=/Users/your_user_name/Desktop/search/conf/slaves
The next step was to configure our /conf/slaves file on the master to include the slave machine names as follows:
master
slave1
slave2
We also have to create a file called search-servers.txt containing the slave machine names and port numbers to be used for “search servers”. This can optionally be added to the slave machines, as well (in case one is later used to be the master). We add this file to a new folder within the Nutch distribution folder:
PC: C:\nutch\search\servers\search-servers.txt
Mac: /Users/your_user_name/search/servers/search-servers.txt
This file contains the following (host, port):
master 1234
slave1 1234
slave2 1234
It should be noted that the “master” is listed in these two files because the master also provides a crawl database to search with. This isn’t necessary, but we used the master to search and crawl. It should further be noted that modifications to our “hosts” file were made so that we can route these machine names to specific IP addresses. If your machines are on the same network, you can alternatively use the machine name as opposed to modifying your “hosts” file.
On Windows, this file is located in:
C:\Windows\System32\Drivers\etc\hosts
On Mac the hosts file is located in:
/etc/hosts
The following is an example of the master’s host file:
127.0.0.1 localhost
127.0.0.1 master
128.235.71.13 slave1
128.235.65.61 slave2
Again, this step is optional, but makes configuration easier. Doing it this way, we only have to change the host file if any of the machines’ IP addresses change. It is also easier to use a name rather than an IP address to refer to the machine.
This concludes the configuration of Nutch as a command-line application. Next, we have to configure the Nutch search engine within Apache Tomcat.
Tomcat and Nutch Search Engine
Note that this section is for the “master” machine only. Slaves do not necessarily need to run Tomcat in this setup. In our Nutch distribution folder, we have a file called “nutch-0.9.war”. This is a deployment file which we can load into Apache Tomcat to set up as a web app. Here’s how we do it:
- Run Tomcat by entering the command startup.sh
- In your web browser, go to: http://localhost:8080
- On the left-hand side of the Apache Tomcat website, click on “Manager”
- Under “Deploy”, “WAR File to Deploy”, click “Browse” and browse to the Nutch distribution folder. Click on “nutch-0.9.war”.
- Upon successful deployment, you will now have a “nutch-0.9” web app listed in the Manager page, as well as in your Tomcat /webapps/ directory.
- Within the “Manager” section of Tomcat, click “Start” under “nutch-0.9”. This starts the web app.
- Browse to the search engine website: http://localhost:8080/nutch-0.9/
Now we must make some configurations to the web app.
- Browse to: {tomcat directory}/webapps/nutch-0.9/WEB-INF/classes
- Open nutch-site.xml for editing.
- Set the searcher.dir property. The master must point to a list of “search servers” which we specified earlier, which resides in the directory: C:\nutch-0.9\search\servers\search-servers.txt.
- We only need to point to the folder that contains this file, so the property is:
<name>searcher.dir</name>
<value>C:\nutch\search\servers</value>
Running Nutch
Now that everything is configured, we can perform a crawl on the slave machines (and optionally the Master), start the search servers on the slave machines, and start Tomcat on the master. Here are step-by-step instructions on how to do this:
Crawling
This may be done on the slaves and/or the master.
- On the master node, start up Cygwin (or open a Terminal in Mac).
- Set the JAVA_HOME environmental variable.
- Browse to your Nutch directory.
- If necessary, edit your /conf/crawl-urlfilter.txt file and your urls file to include the websites that you wish to crawl.
- Type in the command:
bin/nutch crawl urls -dir <path_to_store_crawl_db> -depth <depth_number> -topN <top_pages_number>
Where urls is the folder containing your list of URLs to search, dir is the path to save your crawl database, depth is the number of links to search beyond the root, and topN is the maximum number of pages to retrieve at each level.
Example:
bin/nutch crawl urls -dir crawl -depth 10 -topN 50
Searching
To begin the search service, we must first start the master, then the slaves:
Starting the Master
- On the master node, start up Cygwin.
- Set the JAVA_HOME environmental variable.
- Browse to your Nutch directory.
- Start up Tomcat by typing: bin/startup.sh
- Go to http://localhost:8080 to ensure the Nutch web app is started.
- If the master is also supplying a crawl database, start the search server by typing:
bin/nutch server <port> <crawl_folder>
Where port is a port number you wish to use, and crawl_folder is the crawl database.
Starting the Slaves
- Open a terminal window.
- Set the JAVA_HOME environmental variable.
- Browse to your Nutch directory.
- Start the search server by typing:
bin/nutch server <port> <crawl_folder>
To test the search, go to: http://master:8080/nutch-0.9/
Search results should be queried from each machine running a search server. To test whether or not this is working properly, look at the catalina.out log file in /{tomcat}/logs/catalina.out. You should see a message stating “STATS: x servers, n segments” where x is the number of search servers you are running and n is the number of segments being searched.
Statistics
To retrieve some basic statistics such as the number of pages crawled and indexed, type in the command:
bin/nutch readdb </crawl_folder/crawl_database> -stats
A specific example of this is:
bin/nutch readdb /crawl/crawldb -stats
Questions, problems, or comments? Ask away!
Tags: distributed search, nutch
