Advanced Web Site Search with SQL

I was recently working on a custom content management system for a client and I was adding a search box to the site to allow users to search the client’s site. Because all the content is contained within a database, I can easily search through the pages table to grab any pages that have the search terms contained within their title, content, or keywords meta tag. I wanted to make it a little more robust than simply doing a select statement and grabbing any pages with the search term and I wanted to also rank the results based on a relevance factor determined by where the search words were found.

I was recently putting together a custom search form for a client, and wanted to improve on it from the basic “SELECT title FROM page WHERE title LIKE ‘%search-query%’” method of searching a site. I wanted to include some more relevance to the search so pages that have the search words in their title were a little more relevant, as well as pages with the search term in the filename, and keywords meta tag as well as the body.

I hate most search forms on web sites simply because they’re so horrible. Many of the sites I’ve worked on with pre-existing search components have just done a simple select statement that grabs any page from the database that has the search word in the body of the content. They don’t take into account the page title, keywords, filename, along with the content and it can often make for less-than-intelligent search results. This has even led to many site owners installing a google search box on their site, however, for the kind of projects I take on, I like to custom-build as much as I can for design flexibility and future maintainability.

Search is Important

Search is also very overlooked by site owners. They’ll setup something like google, or atomz, or htDig, and not really test the relevance of the search to see if the search results as relevant as they should be. Even google’s custom site search isn’t the most relevant when you’re site isn’t optimized accordingly, and when you’re clients are in control of the content you can’t ensure that the site is optimized for google – making even google site search less than favorable.

The search function in an e-commerce site is even more important than a content site as you’ll find that the search box is the primary method for users to find product on your site. That’s why Amazon’s search is so prominent on their web site.

Relevancy-based Search

Just as google considers so many factors in their search results, you should consider more than just the content of the page. In the case of my custom CMS, I’m allowing users to add keywords, descriptions, titles and other standard META tags and I’ll base my search results off each of those fields as well as the content of the page. Some of the search components I’ve seen have used multiple database queries to grab all the pages where the search query matched the title, another query to grab the pages that matched the keywords, another for the description, and finally one more for the content. Then moving those results into a bunch of arrays and applying some more logic to determine how relevant those results are and send them to the search results page. There isn’t anything particularly wrong with this, but I like to do things in the least number of database connections as possible, and to keep it as simple as possible from the code side of things.

Searching Based on Multiple Fields

Here’s the sql statement I’ve used – I’ll explain it below:

SELECT title, filename, sum(relevance)
FROM (
SELECT title, filename, 10 AS relevance FROM page WHERE title like ‘%about%’
UNION
SELECT title, filename, 7 AS relevance FROM page WHERE filename like ‘%about%’
UNION
SELECT title, filename, 5 AS relevance FROM page WHERE keywords like ‘%about%’
UNION
SELECT title, filename, 2 AS relevance FROM page WHERE description like ‘%about%’
) results
GROUP BY title, filename
ORDER BY relevance desc;

The sql is a little more complex than a basic select statement but if you’re familiar with unions then it’s pretty simple. All I’ve done is created a number of select statements that grab the pages where the field matches the search term. In this case, the search term is "about". I’ve searched through the title, filename, keywords, and description fields and left the content field off the search results. I didn’t want to include the content because I decided if the search isn’t found in the title, filename, keywords or description then it really isn’t very relevant for the search and even if it does have it in the content, it’s most likely a minor part of the content and not too helpful for the user.

I’ve manually included a relevance field in each of these select statements assigning a "point-value" to each page in those results in order to weight certain fields. These relevance points are totally up to you, and for the purposes of this example, I’ve given the results 10 points if the search word is in the title, 7 if it’s in the filename, 5 if it’s in the keywords, and 2 if it’s in the description.

Next, I union all of those search results together to create a large list of results. However, if a page appears in more than one of those select statements then we need to group them, so using a subquery, I’ve summed the relevance fields and then grouped by the filename and title then ordering the results descending by relevance.

Search Results

title filename sum(relevance)
About the Chamber about-chamber 17
About our Programs programs 10
Policies policies 2
Opportunities opportunities 2

Here the about page came up number one because it had the word about in the filename as well as in the title, and the programs page came in second since the word about was in the title, but not in the filename. There were some other pages that had the word in the description so they were only given 2 points. Obviously a search for the word about isn’t going to be very helpful, but to outline this example I think it worked well. Now I just need to simply output these results from mysql with php and the search results are done. I grabbed the title to output on the page and then filename to link to from the search results page.

Relevance Percentage

Rather than outputting the actual relevance score sometimes you’ll see a percentage of how relevant the page is. In our case that’s easy to do as well. Based on our formula of 10 points for the title, 7 for the filename, 5 for keywords, and 2 for description it’s possible to get a total of 24 points. To get the pages percentage relevance we just divide the relevance score from our query by 24 and then multiple by 100. For example, 17/24*100=70% so our about page was 70% relevant.

Tweaking for Better Search Results

You will need to tweak this formula as you test your search functionality, and adjust the weighting for each field to get the best results. You might also want to add another select statement that grabs all pages where the title is exactly the same as your search – in that case you might want to give them 100 points to ensure that that page is weighted much higher than the others. This will help you tweak your search results for the best user experience.

Full Text Search

I’ve provided the examples using the LIKE clause simply because most users will already be familiar with it, but also because the fields that I’m searching are relatively small – I wasn’t searching the entire contents field of the pages on the site. If I wanted to add the content field, I’d probably use full-text searching using the match() function, you can read up on that here.

2 Comments on “Advanced Web Site Search with SQL”

  1. Thomas Says:

    been searching for this forever.. you sql statement saved my life.. I love it,

    got 1 request… i need to join a category lookup table.

    how can i adjust the sql statment to do so?

  2. Stephen Says:

    A very interesting, simple and effective way to get search results based on relevance. I’m glad I found this blog.

Comment