Automated keyword market research
The objective of this bachelor thesis was to develop a prototype that can find relevant information based on keywords and provide a summary of the web pages in the results. This enables the user to quickly get information on certain keywords.
Ulrich Pogson, 2022
Bachelor Thesis, Institut für Wirtschaftsinformatik, Hochschule für Wirtschaft FHNW
Betreuende Dozierende: Elzbieta Pustulka
Keywords: Text extraction & summarization for research
Views: 21 - Downloads: 5
The Competence Center Systems Engineering is interested in finding a solution to automate the research process of finding possible solutions to business problems. Many areas of the economy are characterized by a high degree of innovation and a multitude of new technical solutions, which makes it difficult for potential users to maintain a good overview. Suitable tools could facilitate the search for and an overview of new IT-related solutions in the health care environment.
Based on the requirements, preliminary research was done on how such a prototype could be developed. In the preliminary research, five research fields were identified: User Interface, Hosting, Search Engine, Web Scraping & Automatic Summarization. A suitable library or solution was found for each of the fields. For the automatic summarization, the transformer models, a pretrained model for summarization from Hugging Face was used. Once the prototype was developed, the results were evaluated based off five keywords that were provided at the start of the thesis.
The client was provided with a working prototype written in Python, that is not dependent on any SaaS Solutions. The prototype can be hosted on a suitable host and be accessible through the browser. The simple and well-documented code allows for the code to be used as a basis for further optimizations.
The evaluation of summaries and results of five keywords has found that a few challenges with the scraping of data from the webpages and the maximum length of the input content that the transformer models can process. Two solutions to solve this challenge were applied. The first solution was to reduce the input content length by filtering the content with the keywords. The second solution was to batch process the content by creating blocks of content that were smaller than the maximum length. In the evaluation, the solutions worked well, but also identified areas that could be further optimized.
Studiengang: Business Information Technology (Bachelor)
Fachbereich der Arbeit: