Background
At my workplace, we are currently collaborating on a project to develop an Artificial Intelligence (AI) engine designed to extract threat data from uploaded threat reports. To train this AI model effectively, we need approximately 500 threat reports in PDF format.
Sources of Threat Reports
So, where do we get these threat reports? We rely on several sources:
Previously Read and Archived Threat Reports: Our own collection of previously reviewed and archived threat reports.
Open Source Database: vx-underground.org.
Google: For additional reports.
Vx-underground.org is a comprehensive website focused on malware and cybersecurity education. It offers a vast repository of malware samples, cybersecurity papers, and threat reports. This site provides yearly archived APT reports, conveniently zipped for bulk downloads, spanning from 2010 to 2023. Since the 2024 reports have yet to be archived and zipped, we developed a Python program to automate and expedite the downloading process.
Image 1: Snapshot of the python program for downloading threat reports from vx-underground
Acquiring the Latest Reports
Our requirement is to use only the latest threat reports. Therefore, we downloaded the 2023 and 2024 reports from vx-underground. However, these reports alone do not meet our 500-report target, necessitating the acquisition of additional reports from other sources.
Utilizing Google Dork
To supplement our collection, we employed Google Dork to search for and download reports. To ensure the credibility of our sources, we identified several reputable organizations known for publishing threat reports:
Crowdstrike
Recorded Future
Mandiant
IBM
Microsoft
Cybersecurity and Infrastructure Security Agency (CISA)
Image 2: Sample google dork for searching APT threat reports
Image 3: 2022 evaluation of digital threat intelligence management by Quadrant Knowledge Solutions
Automating the Process
To avoid the tedious task of manually searching and downloading these reports, we needed an automated solution. We considered two approaches: using the Python Selenium library and the Google API Python client.
Python Selenium: This approach allows us to scrape reports from the initial search results loaded in the browser. However, it is limited to scraping only what is immediately visible.
Google API Python Client: This method does not have the same restrictions and can potentially automate the download process more effectively.
By leveraging these tools and methods, we aim to efficiently compile the necessary threat reports to train our AI model, ensuring both the quantity and quality of our dataset.
Image 4: Snapshot of the python program for downloading threat reports via Google API Python Client
Conclusion
This writeup aims to illustrate the possibility of using Python programming to automate the collection of datasets from open sources to train AI model. We hope readers will find this article useful in understanding the process and potential of automating data collection for AI training purposes.
تعليقات