top of page
Tech Man

Increase Productivity of Online Research with Generative AI

Updated: Dec 28, 2024




Background


Does your work involve extensive research and reporting, requiring you to compile findings into reports ? If so, you're likely familiar with the vast information available on the internet, a crucial resource in today's digital age.

 

For my work in cyber threat intelligence and analysis, I often need to quickly analyze and compile information about specific threats, such as threat actor groups or malware. While commercial threat intelligence feeds are invaluable, open-source information from the public internet also plays a significant role in threat research and analysis. However, unlike threat intelligence platforms (TIPs) where data is well-organized and easy to search, open-source threat information can be scattered across various formats like public threat intelligence reports, blogs, news articles, and more. To conduct a thorough analysis, one must sift through a wide range of information, which can be time-consuming. Automating the reading and extraction process would significantly streamline this task.


Automating Search

 

The foundation of online research is effectively “searching the internet.” This process can be automated using APIs from search engines such as Google and Brave. Both platforms provide APIs for general and news searches. The advantage of programmatic search is that the results—including the title, description, and URL of a webpage—can be easily saved into formats like Excel or CSV, enabling efficient post-processing.


Check out the following GitHub repositories for code examples on performing programmatic Google general search and news search:

Documentation for Brave’s search API is available here.


Automating Search Result Post-Processing


Generative artificial intelligence (GenAI) can be leveraged to process search results programmatically. There are numerous use cases for GenAI, including extracting threat information, summarizing web content, and understanding key messages, among others. Models like Google Gemini, OpenAI ChatGPT, and Meta Llama each have their pros and cons, and some offer free tiers for developers with specific rate limits. You can customize GenAI prompts to fit your specific use cases, such as:

  • Understanding key messages from online articles

    • prompt = "What is the core idea about generative artificial intelligence for me to take away from the following and provide me the evidence which you use to conclude this: \""

  • Summarizing online articles:

    • prompt = "help to summarize the following passage and convert into html format for easy reading"

  • Extracting threat information from online articles:

    • prompt = "Tell me about \"APT41's aliases\", \"motivation of attack\"," \"\"malwares use\", \"tools used\", \"vulnerabilities exploited\", " \"\"targeted industry\", \"targeted region or country\" , \"tactic, technique and procedures\", " \"\"Mitre Att&ck tactics and techniques\", \"indicators of compromise\", " \"\"key information to take away\" from the following write up and then convert them into html format for easy reading: \""


Architecture


To automate the search and post-processing of its results, a Python program can be developed based on the architecture depicted in the diagram below.


Img1. Architecture of python program for automating search and post processing

Using a search engine API, a URL of a web article can be programmatically retrieved. The content of the webpage can then be fetched using Langchain's WebBaseLoader. Here’s a snippet of the code to achieve this:


Img2. Code snippet for fetching wbepage content

The retrieved webpage content can then be passed into a GenAI model with a specific prompt for post-processing. Below is a snippet of code for running the Meta Llama 3:8b model locally on a machine via Ollama, a software tool for running large language models (LLMs). This approach eliminates the need for an API, thus making it cost-free. However, the processing speed and efficiency are contingent on the machine's specifications, such as RAM and GPU.


Img2. Code snippet for running Meta Llama 3:8b via Ollama locally

Proof-of-Concept Implementation


For a complete example of a Python program that automates the search on Brave using its API, retrieves webpage content via Langchain's WebBaseLoader, and performs post-processing using Meta's Llama 3:8b model via Ollama, please refer to the GitHub repository: https://github.com/cyberanalyst86/ollama


This repository contains the code to programmatically search, retrieve, and process web content, saving the output as HTML for easy reading. Below is a snapshot of an HTML file showing the output of the Python program that summarizes a web article.


Img3. Sample output of summarized web article in html 

 

Limitations


Nothing is perfect, and the following are the limitations of the program:

  1. Duplicate Information: Search engine results with different titles may contain similar information, leading to duplicates.

  2. Content Accessibility: Not all webpages can be loaded for processing due to paywalls or content restrictions that require registration.

  3. Model Accuracy: The accuracy of the output from a GenAI model depends on the algorithm and the data it was trained on, which means the output may not always be consistent.

  4. Processing Speed and Efficiency: The speed and efficiency of processing are contingent on the machine's specifications, such as RAM and GPU.


Conclusion


While this program is not perfect and does not eliminate the need for human effort to screen the output, it does significantly increase the productivity of online research by automating some of the manual processing tasks.


17 views0 comments

Comments


bottom of page