MSc Projects

Potential MSc Project Topics


1. Document fingerprints are widely used to compare a document with a database of other documents, for the purpose of detecting plagiarism.  A variety of different fingerprints have been proposed.  Different algorithms have different free parameters that affect performance.  This project would investigate the affect of these parameters on performance and compare the performance of various algorithms.

2. Document fingerprints can be used as keys to access a database that provides complete citation information for the document.  This database can be constructed in a variety of ways. This project will consider a data mining approach using three primary sources. The first is 2Tbytes of data related to  The second is a combination of citation data from dblp together with pdf’s retrieved via GoogleScholar.

3. Shingles are a form of document fingerprint that are used to detect near-duplicate web pages. This project would investigate how to adapt shingles for the purpose of detecting near-duplicate images

Contact Prof. Ingemar Cox for more information.

Information Retrieval and Data Mining

1. Content Recommendation and Collaborative Filtering

2. Mobile location-aware search. Consider now you are in London and want to find a restaurant nearby. Can your smartphone suggest one, based on your location and navigate you through the road? Can you also access and get a suggestion from other people who have already visited to the restaurant?

3. Web Sentence Search. Have you experienced any difficulties while you are writing your English essays?
If yes, the Internet might help you. Tons of web pages from many online news paper websites provide most authoritative examples for English writings. But the problem is how to find them. More specifically, how to retrieve the examples of sentences or paragraphs that contain the similar usage of the term that you are looking for.

4.  Our industry collaborator Polecat ( provides exciting practical thesis projects for students to select. The company’s MeaningMine business intelligence software collects and analyzes all forms of unstructured online and social media content. This online conversation data (news, blogs, tweets, etc) is analyzed to deliver strategic insights and trends to Fortune 500 companies and government entities worldwide, such as competitive intelligence, testing the reception of key marketing messages, and assessing customer/market sentiment.

Potential projects include:

1) to use machine learning and statistical methods  to identify relevant statistical correlations among the data stored in the data warehouse. These correlations will enable the development of predictive models for assessing – among other things – the incremental impact of healthy and unhealthy articles on overall sentiment, whether the release of articles on a particular date affects sentiment in a specific geography, and whether blogs published in one country impact sentiment more than others.

2) to research how language changes over time for a given discussion. Given a subject discussed in the media and blogosphere, can the evolution of the conversation be tracked and reported on. For example, the coverage of the recent Egyptian protests began as a locally reported issue concerning Egypt, but quickly became seen as a touchpaper for wider unrest in the Middle East. It is these types of shifts in topics and vocabulary that the project aims to discover and report upon.

3) to identify up to three top performing information filtering algorithms based on six months of social media postings and online documents pertaining to clean technology. The data warehouse ingests and collates millions of documents based on a wide variety of parameters, such as date, author, publisher, platform, geography and sentiment.

Contact Dr. Jun Wang for more information.