GitGuardian, the world leader in automated secrets detection, launched today its 2024 edition of the State of Secrets Sprawl report. The study —the most comprehensive research on exposed secrets in public GitHub— reveals that 12.8M new secrets occurrences were leaked publicly on GitHub in 2023, +28% compared to 2022. Remarkably, the incidence of publicly exposed secrets has quadrupled since the company started reporting in 2021.
The growing number of code repositories on GitHub, with 50 million new repositories added in the past year (+22%), increases the risk of both accidental and deliberate exposure of sensitive information. In 2023 alone, over 1 million valid occurrences of Google API secrets, 250,000 Google Cloud secrets, and 140,000 AWS secrets were detected.
While the IT sector, which includes software vendors, is the most affected industry, with 65.9% of all detected leaks, other industries are also impacted. These include Education, Science & Tech, Retail, Manufacturing, and Finance & Insurance, which account for 20.1%, 7%, 1.5%, 1.2%, and 1% of leaks, respectively.
This highlights the need for increased vigilance and proactive measures to protect sensitive information across all industries as the risks associated with secret sprawl continue to grow.
The Security Gap of Non-Revoked Secrets: A Major Risk for Companies
The research sheds light on an important security gap: upon discovering an exposed valid secret, 90% remain active for at least five days, even after the author is notified. API keys and authentication tokens for major service providers such as Cloudflare, AWS, OpenAI, or even GitHub are often affected by non-revoked secrets.
“Developers erasing leaky commits or repositories instead of revoking are creating a major security risk for companies, which will remain vulnerable to threat actors mirroring public GitHub activity for as long as the credential remains valid. These zombie leaks are the worst,” said Eric Fourrier, CEO and Founder of GitGuardian.
To assess the prevalence of zombie leaks, the study selected a random sample of 5,000 erased commits that had exposed a secret. Of the repositories that hosted these commits, only 28.2% were still accessible at the time of the study. This indicates that the remaining repositories were likely deleted or made private in response to the leak, suggesting that the prevalence of zombie leaks may be underestimated.
Furthermore, the study hypothesizes that companies may use DMCA takedowns as a means to govern leaky repositories over which they do not have control. In support of this, the study found that in 2023, 12.4% of the 2,050 repositories taken down by GitHub exposed at least one secret, representing a 37.8% increase from 2020.
These findings are crucial for grasping the full scope of the secrets sprawl issue. While most security initiatives focus on detecting leaks, the bottleneck lies in improving the security posture. Simply alerting developers falls short; what's truly essential is providing them with the necessary guidance and support to rectify their mistakes effectively.
"The Toyota breach in 2022, which occurred after a hacker obtained credentials for one of its servers from source code published on GitHub, is proof that even five years after a leak, a compromise can still happen," said Eric Fourrier.
Download the State of Secrets Sprawl 2024 report here.
A webinar presenting the report will be held on March 28th at 11 AM EDT
Expanding beyond its previous editions, the report also explores the following topics:
Powering Secrets Detection with AI
Acknowledging that 2023 was a breakthrough year for Generative AI, the report explores whether LLMs models could serve as an alternative to traditional secret detection tools, before leading to an in-depth look at GitGuardian's AI-driven approach to enhancing the detection and management of secrets.
Unveiling Secret Exposures
Leveraging HasMySecretLeaked, the study also reveals that 3.11% of secrets leaked in private repositories were also exposed in public repositories. This dismantles the idea that relying on the privacy of source code as a security layer is a valid strategy.
Secrets Sprawl in PyPI
Secrets sprawl affects more than code repositories. This year, GitGuardian expanded its investigation into the pervasiveness of leaked secrets within PyPI (the official third-party package management system for the Python community). In 2023, 11,054 unique secrets were exposed in package releases. Approximately 10,000 of those secrets had been there since before 2023, and over 1,000 had been introduced that year.
Solving Secrets Sprawl
Lastly, the report provides a set of valuable recommendations for organizations committed to tackling secrets sprawl. A blend of awareness, training, and efficient, automated processes is essential. However, organizations must also employ discovery tools and robust controls. This is where secrets detection and remediation platforms come in, facilitating continuous security assessment of secrets, enforcing consistent policies throughout the software development lifecycle, and speeding up incident resolution.
Additional resources
About GitGuardian
GitGuardian is the security platform for the DevOps generation. Founded in 2017, it has become the leader in automated secrets detection and is now focused on providing a comprehensive software supply chain security platform.
GitGuardian helps security teams define and enforce secure coding practices consistently and globally at every step of the software development process. Centered on collaboration between security and development teams, GitGuardian also helps organizations enhance their security posture by decentralizing and accelerating the remediation of hardcoded secrets vulnerabilities and misconfigurations in infrastructure-as-code and open-source dependencies.
Widely adopted by developer communities, GitGuardian is the #1 security application on GitHub Marketplace and is used by over 300 thousand developers and leading companies, including Snowflake, Orange, Iress, Mirantis, Maven Wave, Payfit, and Bouygues Telecom. To learn more about GitGuardian, visit https://www.gitguardian.com.
Methodology
Secret
A secret is any sensitive data we want to keep private. When discussing secrets in software development, we refer to digital authentication credentials that grant access to services, systems, and data. These are most commonly API keys, username and password combos, or private keys. In this report, secrets refer to credentials hard-coded in plaintext.
Secrets detection
GitGuardian continually scans every public GitHub commit in real-time for leaks using its advanced secrets detection engine, operational since 2017. This engine is consistently refined by examining billions of commits to strike the optimal balance between precision (minimal false positives) and recall (minimal false negatives). Beyond simple regex patterns, the state-of-the-art secrets detection engine uses a sophisticated blend of pre and post-validation steps to achieve optimal results. GitGuardian boasts an extensive library of specific detectors, capable of identifying over 350 types of secrets, along with a unique set of generic detectors designed to capture a wide array of secrets not specific to any particular service. An exhaustive list can be found here.
For deeper insights into the workings and performance benchmarking of our detection engine, visit our blog.
Study perimeter
To ensure that the data presented here most accurately represents the state of secrets sprawl, particularly to eliminate as many generic false positives as possible, filtering was applied to the data collected in 2023. Beyond the filtering process, we also manually excluded outliers—repositories exhibiting abnormally high leak rates, where a secret might be committed every minute—from this defined perimeter to ensure the integrity and accuracy of our metrics.