Background
Does your work involve extensive research and reporting, requiring you to compile findings into reports ? If so, you're likely familiar with the vast information available on the internet, a crucial resource in today's digital age.
For my work in cyber threat intelligence and analysis, I often need to quickly analyze and compile information about specific threats, such as threat actor groups or malware. While commercial threat intelligence feeds are invaluable, open-source information from the public internet also plays a significant role in threat research and analysis. However, unlike threat intelligence platforms (TIPs) where data is well-organized and easy to search, open-source threat information can be scattered across various formats like public threat intelligence reports, blogs, news articles, and more. To conduct a thorough analysis, one must sift through a wide range of information, which can be time-consuming. Automating the reading and extraction process would significantly streamline this task.
Automating Search
The foundation of online research is effectively “searching the internet.” This process can be automated using APIs from search engines such as Google and Brave. Both platforms provide APIs for general and news searches. The advantage of programmatic search is that the results—including the title, description, and URL of a webpage—can be easily saved into formats like Excel or CSV, enabling efficient post-processing.
Check out the following GitHub repositories for code examples on performing programmatic Google general search and news search:
Documentation for Brave’s search API is available here.
Automating Search Result Post-Processing
Generative artificial intelligence (GenAI) can be leveraged to process search results programmatically. There are numerous use cases for GenAI, including extracting threat information, summarizing web content, and understanding key messages, among others. Models like Google Gemini, OpenAI ChatGPT, and Meta Llama each have their pros and cons, and some offer free tiers for developers with specific rate limits. You can customize GenAI prompts to fit your specific use cases, such as:
Understanding key messages from online articles
prompt = "What is the core idea about generative artificial intelligence for me to take away from the following and provide me the evidence which you use to conclude this: \""
Summarizing online articles:
prompt = "help to summarize the following passage and convert into html format for easy reading"
Extracting threat information from online articles:
prompt = "Tell me about \"APT41's aliases\", \"motivation of attack\"," \"\"malwares use\", \"tools used\", \"vulnerabilities exploited\", " \"\"targeted industry\", \"targeted region or country\" , \"tactic, technique and procedures\", " \"\"Mitre Att&ck tactics and techniques\", \"indicators of compromise\", " \"\"key information to take away\" from the following write up and then convert them into html format for easy reading: \""
Architecture
To automate the search and post-processing of its results, a Python program can be developed based on the architecture depicted in the diagram below.
Using a search engine API, a URL of a web article can be programmatically retrieved. The content of the webpage can then be fetched using Langchain's WebBaseLoader. Here’s a snippet of the code to achieve this:
The retrieved webpage content can then be passed into a GenAI model with a specific prompt for post-processing. Below is a snippet of code for running the Meta Llama 3:8b model locally on a machine via Ollama, a software tool for running large language models (LLMs). This approach eliminates the need for an API, thus making it cost-free. However, the processing speed and efficiency are contingent on the machine's specifications, such as RAM and GPU.
Proof-of-Concept Implementation
For a complete example of a Python program that automates the search on Brave using its API, retrieves webpage content via Langchain's WebBaseLoader, and performs post-processing using Meta's Llama 3:8b model via Ollama, please refer to the GitHub repository: https://github.com/cyberanalyst86/ollama
This repository contains the code to programmatically search, retrieve, and process web content, saving the output as HTML for easy reading. Below is a snapshot of an HTML file showing the output of the Python program that summarizes a web article.
Limitations
Nothing is perfect, and the following are the limitations of the program:
Duplicate Information: Search engine results with different titles may contain similar information, leading to duplicates.
Content Accessibility: Not all webpages can be loaded for processing due to paywalls or content restrictions that require registration.
Model Accuracy: The accuracy of the output from a GenAI model depends on the algorithm and the data it was trained on, which means the output may not always be consistent.
Processing Speed and Efficiency: The speed and efficiency of processing are contingent on the machine's specifications, such as RAM and GPU.
Conclusion
While this program is not perfect and does not eliminate the need for human effort to screen the output, it does significantly increase the productivity of online research by automating some of the manual processing tasks.
Comments