DocSearch,

by Bharathram Hariharan, 800886307 and Midhun Mathew Sunny, 800886599

Introduction

This project is a demonstration of creating an internal search engine solely based on cloud computing tools and techniques. We have used Spark for big data processing. This project is intended for generic data and data definition is defined by constants and hence this project could be reused for any type of data. This Article gives us an insight about how the Amazon search algorithm could be.

Click for the Project Code

Objective

Perform data/item retrieval for e-commerce data set from their meta data. For this project, the data set is of books. For more information on how the meta data for the books should be chosen, refer this link

Data-set

We are using book's meta data for this project. We have crawled 127926 book’s data from Google books API. Size of dataset is approximately 200MB.

Each record has meta data for a book in json format and following are the entities.

Sample format for a book record

{"uKQ0CgAAQBAJ": {"imageLinks": {"smallThumbnail": "http://books.google.com/books/content?id=uKQ0CgAAQBAJ&printsec=frontcover&img=1&zoom=5&edge=curl&source=gbs_api", "thumbnail": "http://books.google.com/books/content?id=uKQ0CgAAQBAJ&printsec=frontcover&img=1&zoom=1&edge=curl&source=gbs_api"}, "categories": ["Fiction"], "description": "Meredith has never considered herself submissive even though her greatest fantasy is being pleasured against her will. When Mark orders her to her knees the first time, she can\u2019t get there fast enough\u2014and then hates herself afterward for losing control. A", "publisher": "Ellora's Cave Publishing Inc", "ISBN_13": "9781419994289", "keyWords": ["Meredith", "Mark"], "infoLink": "http://books.google.com/books?id=uKQ0CgAAQBAJ&dq=go&as_pt=BOOKS&hl=&source=gbs_api", "authors": ["L.E. Chamberlin"], "ISBN_10": "141999428X", "maturityRating": "NOT_MATURE", "title": "The Rewards of Letting Go"}}

Additionally, view count for each record is provided separately.

Sample view data format: ('NLngYyWFl_YC', 1292215)

Motivation

Most of the websites which we use (For example, Snapdeal) are not giving accurate results as user wants. Hence, we felt this would be an interesting and useful project to work upon.

Algorithm/Framework

External tools used

Deliverables

What we will definitely accomplish:

We would definitely complete the basic search engine part which makes use of indexing and query retrieval based on relevance using Map Reduce paradigm. This is completed.


What we are likely to accomplish:

We would like to improve the search accuracy after seeing results from step 1 as our next step. Used view ranking for this purpose and had achieved this. This is completed.


What we would ideally like to accomplish:

We would plan on adding suggestions and autocorrect features, if time permits, or keep those as future steps. This step is time consuming can be taken be taken as future work. This is not completed.

Work division

Midhun

Data crawling, Data cleaning, KeyWord generation


Bharath

Indexing, Querying

Project flow

Data Extraction

Google books api was used to get the book details. For making query in google books api, search terms are to be provided. Search terms we took was the category names. This category name was crawled from goodreads.com

InitialDataExtraction/category_list_backup.py has a list of category name which we got from goodreads.com. A total of 802 category names were retrieved.

Each query from google books api can yield up to 40 records. Typically, for a search term, total number of available records would in the range of 1000 - 3000.

Challenges in data extraction phase

Keywords generation

KeyWord sub project was to retrieve keywords from description of the books.

Tried two approaches

Insights from two methods

Keywords from NER's found to be unreliable and Proper Nouns gave better results. Issues with NER's were, many unwanted words were generated and Improper entity definition for the terms and hence unreliable. However, Proper Nouns were not giving False Positives as in the former case

Json data manipulation was performed using gson library

Input to this method was the data retrieved from google api.

Output is an added entity called keyWords in book record.


Indexing

This method generates an inverted index for each book record and are stored in index_rdd map.

Input for this method is referenced by id_doc_rdd_raw and its data format is as mentioned above.

index_rdd map has indexed term mapped to corresponding document id along with the zone/entity name

Following is an example.

('aceline', (('K3NLoAEACAAJ', 'keyWords'), ('7DiNBwAAQBAJ', 'title')))

'aceline' is the indexed term, K3NLoAEACAAJ is the doc_id, keyWords was the zone and likewise.

Brief walk through about Map Reduce in this phase


Querying

Each entity needs to have different weights. If 2 documents have the search term present in title and keyWords respectively, we have to give more priority to the document whose title matched. Likewise, more visited document has to be given higher priority.

Input for this method is the output from Index method and views_rdd.

Following is a weight map which was used in our model.

MODEL_WEIGHTS = {ISBN_10: .3, ISBN_13: .3, AUTHORS: .18, PUBLISHER: .15, TITLE: .15, CATEGORIES: .12, KEYWORDS: .1, VIEWS: .5}

Above weights are formulated after lot of query search experiments and intuitions.

View counts are generated for only select few documents.

Output of this phase is a list of tuple of document_id and score in sorted order.


Following table shows the accuracy improvement in the search while adding view count as a parameter.

Search term for this example was "thomas cormen"

Document Title With View Rank Without View Rank
Introduction to Algorithms 1 8
Introduction to Algorithms 2 18
Introduction to Algorithms 3 12
Introduction to Algorithms 4 14
Introduction to Algorithms 5 2
Introduction to Algorithms 6 5
United States Computer Specialist Introduction 7 23
Algorithms Unlocked 8 6
Faculty by University Or College in New Hampshire Faculty by University Or College in New Hampshire: Dartmouth College Faculty, University of New Hamp 9 24
Introduzione Agli Algoritmi 10 3
Introduction to Algorithms-Instructor?s Manual 11 11
Concentrator Switches for Routing Messages in Parallel Computers 12 7
Introduction to Algorithms and Java CD-ROM 13 16
算法导论 14 19
Computer Science Teachers 15 29
Nine Algorithms That Changed the Future 16 32
Coding Puzzles, 2nd Edition 17 25
Introduction to Algorithms and Java CD-ROM 18 9
Introduction to Algorithms 19 10
Vector Layout in Virtual-memory Systems for Data-parallel Computing 20 20
Introduction to Algorithms 21 1
Introduction to Algorithms and Java CD-ROM 22 15
Introduction to Algorithms (Instructor's Manual) 23 13
Introduction to Algorithms 24 21
Studyguide for Introduction to Algorithms by Thomas H. Cormen, Isbn 9780262033848 25 17
Introduction To Algorithms 26 4
e-Study Guide for Introduction to Algorithms, textbook by Thomas H. Cormen 27 22
Coding Puzzles 28 27
North American Scientist Introduction 29 30
North American Scientist Introduction 30 31
American Scientist Introduction 31 26
American Computer Specialist Introduction 32 28

Similar titles are there in the table, since, a book could have multiple editions, and book types(hardcover, paperback, e-book).

Brief walk through about filter phase after Querying

Results

Mini search engine was implemented successfully.

Note that we had a total of 120,000 books in our sample data-set and hence a direct comparison between an established search engine like google wont give exact accuracy rate. However, for the books which have, following are findings.

Following table shows the comparison of our search engines rank with the books which came 1st in google

Term Document Id Rank in this system
clrs i-bUBQAAQBAJ 1
deception point BjFZ0UQOUCgC 1
bleak house OeD4XMKlxUIC 1