Java-Web:Search Servlet

From Juneday education
Jump to: navigation, search

This chapter explains a simple search function implemented as a Java Servlet. The example serves also to explain that certain functionality on a web page must be implemented in a programming language (for instance Java and Servlets), because data has to be created on the fly.

Why a Servlet

Imagine that you want to add a search box on your web site. The user should enter some text to search for, and the result should be presented on a new page showing a list of links to pages on your site mentioning the search word.

Obviously, the result page will be different depending on what word(s) the user searches for, so we can't have a static web page with the result links. The results page must be created on-the-fly by some application. The application must read the text of each page and create the results page dynamically with links to all matching pages.

We could (and have) implemented this as a Servlet. So we need a page in HTML with a form for the search box, and let the user's browser send the search word to our Servlet. The Servlet should do a few things:

  • Read the search word (an HTTP GET parameter called e.g. "search_word")
  • Look for the word in all HTML pages in webroot
    • Make HTML with links to each matching page
  • Write the result back as HTML to the client

Setup

Source code

The source code for this example can be found here:

https://github.com/progund/java-web/tree/master/search-servlet

The directory looks like this:

.
├── clean.sh
├── compile_servlet.sh
├── run_winstone.sh
├── winstone.jar
└── www
    ├── ekonomi.html
    ├── search.html
    ├── sport.html
    ├── weather.html
    └── WEB-INF
        ├── classes
        │   └── se
        │       └── itu
        │           └── web
        │               ├── SearchServlet.java
        │               └── Utf8Filter.java
        └── web.xml

6 directories, 11 files

The Utf8Filter class is to force Winstone to serve HTML files as UTF-8 encoded text and has nothing to do with this example. The important file is the se.itu.web.SearchServlet class in www/WEB-INF/classes/se/itu/web/SearchServlet.java .

If you really want to learn about filters, you can see here: Oracle article on Servlet Filters (extra reading) - but as we said, it has nothing to do with this example. We added a filter so that winstone would send all HTML files as UTF-8, that's all.

web.xml

We define one servlet matching the URL pattern /search and let it be handled by the search-servlet in the class se.itu.web.SearchServlet .

<?xml version="1.0" encoding="utf-8"?>
<web-app version="2.5" xmlns="http://java.sun.com/xml/ns/javaee"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://java.sun.com/xml/ns/javaee
                             http://java.sun.com/xml/ns/javaee/web-app_2_5.xsd">
  <servlet>
    <servlet-name>search-servlet</servlet-name>
    <servlet-class>se.itu.web.SearchServlet</servlet-class>
  </servlet>
  <servlet-mapping>
    <servlet-name>search-servlet</servlet-name>
    <url-pattern>/search</url-pattern>
  </servlet-mapping>
  
  <filter>
    <filter-name>force-encoding-filter</filter-name>
    <filter-class>se.itu.web.Utf8Filter</filter-class>
    <init-param>
      <param-name>Content-Type</param-name>
      <param-value>text/html;charset=utf-8</param-value>
    </init-param>
  </filter>

  <filter-mapping>
    <filter-name>force-encoding-filter</filter-name>
    <url-pattern>*.html</url-pattern>
  </filter-mapping> 
</web-app>

search.html

Compile and run the web application:

$ chmod u+x *.sh
$ ./compile_servlet.sh && ./run_winstone.sh
HTML page with a search form

Next, visit http://localhost:8080/search.html . You should be presented with a form for searching the "site". Search for "henrik" and see what pages contain that word. Try also "storm", "aktie" and some word you choose yourself. Try searching for nothing (leave the search box empty). Try removing the search_word GET parameter from the results page URL.

The HTML for the search page is:

<!DOCTYPE html>
<html>
  <head><title>Search this site</title></head>
  <body>
    <form action="/search" method="GET">
      Sök: <input type="text" size="20" name="search_word" />
      <input type="Submit" value="Sök!" />
    </form>
  </body>
</html>

We've highlighted the form which does the magic here. The form allows your browser to show a form with a text field and a button. When you click the button (or press Enter when being in the text field), your browser will send a request to the same site it found the page on, as you can see in your browser's URL (which changes according to the form specification).

The next page you end up on will have the following URL if you searched for "henrik":

http://localhost:8080/search?search_word=henrik

The Servlet

So, the servlet should start off by reading the GET parameter "search_word" into a variable. Next, the servlet should create HTML with links to any pages matching the search word.

This is done in a method. We'll return to that method below, and focus on doGet here.

doGet()

  @Override
  public void doGet(HttpServletRequest request, HttpServletResponse response)
    throws ServletException, IOException {
    request.setCharacterEncoding(UTF_8.name());
    response.setContentType("text/html;charset="+UTF_8.name());
    PrintWriter out =
      new PrintWriter(new OutputStreamWriter(response.getOutputStream(),
                                             UTF_8), true);
    out.println("<!DOCTYPE html>");
    out.println("<html lang=\"en\">");
    out.println("<head><title>SearchResults</title></head>");
    out.println("<body>");
    out.println("<h1>Results</h1>");

    String searchWord = request.getParameter("search_word");    
    String results = getSearchResultsHtml(searchWord);
    if (results.length() == 0) {
      results = "No matches for " + searchWord;
    }
    
    out.println(results);
    out.println("</body>");
    out.println("</html>");
  }

We've highlighted the steps involving reading the GET parameter, getting the results links and printing the result. Of course, the result (or message about the absence of a result) is surrounded by HTML.

getSearchResultsHtml()

This method is passed the search phrase as an argument. If the argument is null or empty, the method immediately returns a message about the absence of a search phrase.

Next, it creates a Path object representing the webroot directory and an empty StringBuilder for the resulting HTML.

The strategy (or "algorithm" if you want to sound like a title for an informatics paper) for finding out what pages to create links for in the result is to create a DirectoryStream of all the Paths from the webroot's HTML files. For each such HTML file, we create a stream of all the lines of text in the file and check if any line matches the search phrase. If a match is found, we create a link with the file name as the text and the file as the target, and add it to our results StringBuilder.

  private String getSearchResultsHtml(String searchWord) {

    if (searchWord == null || searchWord.equals("")) {
      return "No search word given";
    }

    Path root = Paths.get(webroot);
    StringBuilder results = new StringBuilder();

    try (DirectoryStream<Path> dir =
         Files.newDirectoryStream(root, "*.html")) {
      for (Path file : dir){
        if (Files.lines(file).anyMatch(line -> line.toLowerCase()
                                       .contains(searchWord.toLowerCase()))) {
                       results.append("<a href=\"");
                       results.append(file.getFileName());
                       results.append("\">");
                       results.append(file.getFileName());
                       results.append("</a><br>\n");
        }
      }
    } catch (Exception e) {
      System.err.println("error: " + e.getMessage());
    }
    
    return results.toString();
  }

Here's Oracle's tutorial on listing files in a directory: https://docs.oracle.com/javase/tutorial/essential/io/dirs.html#listdir

Further development

If you want a challenge for improving this naïve search engine, here are some improvement suggestions:

  • If the search phrase is several words, loop over each word and search for any of the words (rather than treating the search phrase as a fixed phrase)
  • Search also in all sub directories to webroot (except WEB-INF, which is private) Hint: Walking the File Tree
  • Add support for logical operators, e.g. "henrik AND record", "henrik OR rikard"
  • Add support for exact phrases (putting the search phrase in quotes should make it a phrase and not several words)
  • Only search the text of the pages, and not also the HTML tags - try searching for "body" or "form" and you'll see what we mean

Appendix

Servlet listing

package se.itu.web;

import java.io.IOException;
import java.io.OutputStreamWriter;
import java.io.PrintWriter;
import javax.servlet.*;
import javax.servlet.http.*;
import static java.nio.charset.StandardCharsets.UTF_8;
import java.io.*; // IOException, BufferedReader, File...
import java.nio.file.*; //Files, Paths;


public class SearchServlet extends HttpServlet {

  private static String webroot;

  public void init() {
    webroot = getServletContext().getRealPath("");
  }
  
  @Override
  public void doGet(HttpServletRequest request, HttpServletResponse response)
    throws ServletException, IOException {
    request.setCharacterEncoding(UTF_8.name());
    response.setContentType("text/html;charset="+UTF_8.name());
    PrintWriter out =
      new PrintWriter(new OutputStreamWriter(response.getOutputStream(),
                                             UTF_8), true);
    out.println("<!DOCTYPE html>");
    out.println("<html lang=\"en\">");
    out.println("<head><title>SearchResults</title></head>");
    out.println("<body>");
    out.println("<h1>Results</h1>");

    String searchWord = request.getParameter("search_word");    
    String results = getSearchResultsHtml(searchWord);
    if (results.length() == 0) {
      results = "No matches for " + searchWord;
    }
    
    out.println(results);
    out.println("</body>");
    out.println("</html>");
  }

  private String getSearchResultsHtml(String searchWord) {

    if (searchWord == null || searchWord.equals("")) {
      return "No search word given";
    }

    Path root = Paths.get(webroot);
    StringBuilder results = new StringBuilder();

    try (DirectoryStream<Path> dir =
         Files.newDirectoryStream(root, "*.html")) {
      for (Path file : dir){
        if (Files.lines(file).anyMatch(line -> line.toLowerCase()
                                       .contains(searchWord.toLowerCase()))) {
                       results.append("<a href=\"");
                       results.append(file.getFileName());
                       results.append("\">");
                       results.append(file.getFileName());
                       results.append("</a><br>\n");
        }
      }
    } catch (Exception e) {
      System.err.println("error: " + e.getMessage());
    }
    
    return results.toString();
  }
}

web.xml

<?xml version="1.0" encoding="utf-8"?>
<web-app version="2.5" xmlns="http://java.sun.com/xml/ns/javaee"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://java.sun.com/xml/ns/javaee
                             http://java.sun.com/xml/ns/javaee/web-app_2_5.xsd">
  <servlet>
    <servlet-name>search-servlet</servlet-name>
    <servlet-class>se.itu.web.SearchServlet</servlet-class>
  </servlet>
  <servlet-mapping>
    <servlet-name>search-servlet</servlet-name>
    <url-pattern>/search</url-pattern>
  </servlet-mapping>
  
  <filter>
    <filter-name>force-encoding-filter</filter-name>
    <filter-class>se.itu.web.Utf8Filter</filter-class>
    <init-param>
      <param-name>Content-Type</param-name>
      <param-value>text/html;charset=utf-8</param-value>
    </init-param>
  </filter>

  <filter-mapping>
    <filter-name>force-encoding-filter</filter-name>
    <url-pattern>*.html</url-pattern>
  </filter-mapping> 
</web-app>

sport.html

<!DOCTYPE html>
<html>
  <body>
    Henrik vann 100 meter sprint i fredags.
  </body>
</html>

ekonomi.html

<!DOCTYPE html>
<html>
  <body>
    Henrik köpte 100 Ericsson-aktier igår.
  </body>
</html>

weather.html

<!DOCTYPE html>
<html>
  <body>
    Stormen Henrik nådde Göteborgs kust igår.
  </body>
</html>

compile_servlet.sh

#!/bin/bash

javac -cp winstone.jar www/WEB-INF/classes/se/itu/web/*.java

run_winstone.sh

#!/bin/bash

java -jar winstone.jar --webroot=www

clean.sh

#!/bin/bash

find . -name '*~' | xargs rm -f
find . -name '*.class' | xargs rm -f

Links

Further reading