DocSearch,

by Bharathram Hariharan, 800886307 and Midhun Mathew Sunny, 800886599

Introduction

This project is a demonstration of creating an internal search engine solely based on cloud computing tools and techniques. We have used Spark for big data processing. This project is intended for generic data and data definition is defined by constants and hence this project could be reused for any type of data. This Article gives us an insight about how the Amazon search algorithm could be.

Click for the Project Code

Objective

Perform data/item retrieval for e-commerce data set from their meta data. For this project, the data set is of books. For more information on how the meta data for the books should be chosen, refer this link

Data-set

We are using book's meta data for this project. We have crawled 127926 book’s data from Google books API. Size of dataset is approximately 200MB.

Each record has meta data for a book in json format and following are the entities.

Title Title of the book
Authors List of Authors for the book
Publishers Book Publisher
Description Book Description
Categories List of Categories/Sub-Categories
ISBN_10
ISBN_13
KeyWords Special terms people search for finding this book

Sample format for a book record

{"uKQ0CgAAQBAJ": {"imageLinks": {"smallThumbnail": "http://books.google.com/books/content?id=uKQ0CgAAQBAJ&printsec=frontcover&img=1&zoom=5&edge=curl&source=gbs_api", "thumbnail": "http://books.google.com/books/content?id=uKQ0CgAAQBAJ&printsec=frontcover&img=1&zoom=1&edge=curl&source=gbs_api"}, "categories": ["Fiction"], "description": "Meredith has never considered herself submissive even though her greatest fantasy is being pleasured against her will. When Mark orders her to her knees the first time, she can\u2019t get there fast enough\u2014and then hates herself afterward for losing control. A", "publisher": "Ellora's Cave Publishing Inc", "ISBN_13": "9781419994289", "keyWords": ["Meredith", "Mark"], "infoLink": "http://books.google.com/books?id=uKQ0CgAAQBAJ&dq=go&as_pt=BOOKS&hl=&source=gbs_api", "authors": ["L.E. Chamberlin"], "ISBN_10": "141999428X", "maturityRating": "NOT_MATURE", "title": "The Rewards of Letting Go"}}

Additionally, view count for each record is provided separately.

Sample view data format: ('NLngYyWFl_YC', 1292215)

Motivation

Most of the websites which we use (For example, Snapdeal) are not giving accurate results as user wants. Hence, we felt this would be an interesting and useful project to work upon.

Algorithm/Framework

Algorithm - Inverted index
Framework - Spark

External tools used

nltk for language processing
wordnet corpora for lemmatization

Deliverables

What we will definitely accomplish:

We would definitely complete the basic search engine part which makes use of indexing and query retrieval based on relevance using Map Reduce paradigm. This is completed.

What we are likely to accomplish:

We would like to improve the search accuracy after seeing results from step 1 as our next step. Used view ranking for this purpose and had achieved this. This is completed.

What we would ideally like to accomplish:

We would plan on adding suggestions and autocorrect features, if time permits, or keep those as future steps. This step is time consuming can be taken be taken as future work. This is not completed.

Work division

Midhun

Data crawling, Data cleaning, KeyWord generation

Bharath

Indexing, Querying

Project flow

Data extraction
Keywords generation
Indexing
Querying

Data Extraction

Google books api was used to get the book details. For making query in google books api, search terms are to be provided. Search terms we took was the category names. This category name was crawled from goodreads.com

InitialDataExtraction/category_list_backup.py has a list of category name which we got from goodreads.com. A total of 802 category names were retrieved.

Each query from google books api can yield up to 40 records. Typically, for a search term, total number of available records would in the range of 1000 - 3000.

Challenges in data extraction phase

Duplicate entry detection

This was resolved by keeping a record of already downloaded data. This was important because the program had to be executed over a span of days with breaks due to network blocking problem by Google.
Blocking of our IP by Google

Easiest solution was to wait for a certain period of time before re-requesting and automating this task.

Keywords generation

KeyWord sub project was to retrieve keywords from description of the books.

Tried two approaches

Stanford Named Entity Recognition(NER)
Proper Noun extraction

Insights from two methods

Keywords from NER's found to be unreliable and Proper Nouns gave better results. Issues with NER's were, many unwanted words were generated and Improper entity definition for the terms and hence unreliable. However, Proper Nouns were not giving False Positives as in the former case

Json data manipulation was performed using gson library

Input to this method was the data retrieved from google api.

Output is an added entity called keyWords in book record.

Indexing

This method generates an inverted index for each book record and are stored in index_rdd map.

Input for this method is referenced by id_doc_rdd_raw and its data format is as mentioned above.

index_rdd map has indexed term mapped to corresponding document id along with the zone/entity name

Following is an example.

('aceline', (('K3NLoAEACAAJ', 'keyWords'), ('7DiNBwAAQBAJ', 'title')))

'aceline' is the indexed term, K3NLoAEACAAJ is the doc_id, keyWords was the zone and likewise.

Brief walk through about Map Reduce in this phase

Map Phase

JSON data is converted to python data object
Keys are entities/zones, and value is their corresponding value in string of list format
All string values are converted to list
Lemmatized term for certain entities are taken for some of the entities as defined in the IndexConstants.LEMMA_ENTITIES constant.
Using flatMap, term is mapped to a tuple of document id and entity/zone name

Reduce Phase

This is a reduce by key phase
Addition of tuples of values(tuple of document id and entity/zone name) from map phase appended together for a key(term)

Querying

Each entity needs to have different weights. If 2 documents have the search term present in title and keyWords respectively, we have to give more priority to the document whose title matched. Likewise, more visited document has to be given higher priority.

Input for this method is the output from Index method and views_rdd.

Following is a weight map which was used in our model.

MODEL_WEIGHTS = {ISBN_10: .3, ISBN_13: .3, AUTHORS: .18, PUBLISHER: .15, TITLE: .15, CATEGORIES: .12, KEYWORDS: .1, VIEWS: .5}

Above weights are formulated after lot of query search experiments and intuitions.

View counts are generated for only select few documents.

Output of this phase is a list of tuple of document_id and score in sorted order.

Following table shows the accuracy improvement in the search while adding view count as a parameter.

Search term for this example was "thomas cormen"

Document Title	With View Rank	Without View Rank
Introduction to Algorithms	1	8
Introduction to Algorithms	2	18
Introduction to Algorithms	3	12
Introduction to Algorithms	4	14
Introduction to Algorithms	5	2
Introduction to Algorithms	6	5
United States Computer Specialist Introduction	7	23
Algorithms Unlocked	8	6
Faculty by University Or College in New Hampshire Faculty by University Or College in New Hampshire: Dartmouth College Faculty, University of New Hamp	9	24
Introduzione Agli Algoritmi	10	3
Introduction to Algorithms-Instructor?s Manual	11	11
Concentrator Switches for Routing Messages in Parallel Computers	12	7
Introduction to Algorithms and Java CD-ROM	13	16
算法导论	14	19
Computer Science Teachers	15	29
Nine Algorithms That Changed the Future	16	32
Coding Puzzles, 2nd Edition	17	25
Introduction to Algorithms and Java CD-ROM	18	9
Introduction to Algorithms	19	10
Vector Layout in Virtual-memory Systems for Data-parallel Computing	20	20
Introduction to Algorithms	21	1
Introduction to Algorithms and Java CD-ROM	22	15
Introduction to Algorithms (Instructor's Manual)	23	13
Introduction to Algorithms	24	21
Studyguide for Introduction to Algorithms by Thomas H. Cormen, Isbn 9780262033848	25	17
Introduction To Algorithms	26	4
e-Study Guide for Introduction to Algorithms, textbook by Thomas H. Cormen	27	22
Coding Puzzles	28	27
North American Scientist Introduction	29	30
North American Scientist Introduction	30	31
American Scientist Introduction	31	26
American Computer Specialist Introduction	32	28

Similar titles are there in the table, since, a book could have multiple editions, and book types(hardcover, paperback, e-book).

Brief walk through about filter phase after Querying

Each query term is separated by whitespace
Query term with hiphen are taken as a special case. If the splits(at hiphen, eg: 930-123) for this case are numbers, hiphen is removed and combined together(930-123 becomes 930123). If the splits are alphanumeric(abc-xyz), each split is taken separately(abc-xyz becomes 'abc' and 'xyz')
Lemmatization for each term is done as well in search list
AND operation is performed between all query terms and the results obtained after AND operation is searched through inverted indices.
View count for the documents obtained from inverted index data is also considered.
A final ranking is done for these documents based on the weights for each query term and view count.
Sorting of above result in descending order gives the final serach outputs.

Results

Mini search engine was implemented successfully.

Note that we had a total of 120,000 books in our sample data-set and hence a direct comparison between an established search engine like google wont give exact accuracy rate. However, for the books which have, following are findings.

Following table shows the comparison of our search engines rank with the books which came 1st in google

Term	Document Id	Rank in this system
clrs	i-bUBQAAQBAJ	1
deception point	BjFZ0UQOUCgC	1
bleak house	OeD4XMKlxUIC	1