dedupe

command module
v0.0.0-...-ba01e04 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 9, 2025 License: MIT Imports: 13 Imported by: 0

README ΒΆ

Dedupe

Go Report Card GitHub stars GitHub issues GitHub pull requests

dedupe

Dedupe is a high-performance, memory-optimized Go tool designed to deduplicate URLs from a list. It supports filtering based on query strings, extensions, and more. Perfect for cleaning up large lists of URLs for web scraping, penetration testing, or bug bounty.


Features

  • Fast and Efficient: Built in Go for high performance and low memory usage.
  • Deduplication: Removes duplicate URLs based on hostname, path (normalized), and query parameter keys.
  • Filtering:
    • Exclude URLs with specific extensions (-fe css,png,js) or include only certain extensions (-me php,js).
    • Include only URLs with query strings (-qs).
    • Optional regex-based normalization (-r) treats integers and GUIDs in paths as placeholders, effectively deduplicating similar URLs like /api/user/1 vs /api/user/2.
    • Optional language/country code normalization in paths (-lc).
  • Combine flags freely: Toggle any combination without a separate --mode.
  • Input/Output:
    • Accepts URLs from a file (-i) or stdin.
    • Writes to stdout or to a file via -o.

Installation

Prerequisites
  • Go 1.23 or higher.
Install from Source
git clone https://github.com/0xpugal/dedupe.git
cd dedupe
go build -o dedupe
Install via go install
go install github.com/0xpugal/dedupe@latest

Usage

Basic Usage
# stdin β†’ stdout
cat urls.txt | dedupe -qs

# stdin β†’ file (Linux/Mac)
cat urls.txt | dedupe -qs -o output.txt

# stdin β†’ file (Windows PowerShell)
type urls.txt | .\dedupe.exe -qs -o output.txt

# file β†’ file
dedupe -i urls.txt -o output.txt -qs

# exclude extensions
dedupe -i urls.txt -o output.txt -fe "js,css,png,jpg"

# include only these extensions
dedupe -i urls.txt -o output.txt -me "php,asp"

# normalize integers/GUIDs in paths (treat /1 and /2 as same)
dedupe -i urls.txt -o output.txt -r

# also normalize language/country codes in paths
dedupe -i urls.txt -o output.txt -r -lc
Help menu
Usage:
  cat input.txt | dedupe --output output.txt
  dedupe --input input.txt --output output.txt

Options:
  -i,  --input <file>            Input file (defaults to stdin)
  -o,  --output <file>           Output file (defaults to stdout)
  -qs, --query-string-only       Only include URLs that have query strings
  -fe, --filter-extensions list  Exclude URLs with these extensions (css,png,js,jpg)
  -me, --match-extensions list   Include only URLs with these extensions (php,aspx,jsp)
  -r,  --regex-parse             Use regex normalization (GUIDs, integers)
  -lc, --lang-country            Deduplicate by language/country codes
  -V,  --version                 Show version
  -U,  --update                  Check for updates and install latest
  -h,  --help                    Show this help

πŸ“‹ Example

Input (urls.txt)
https://example.com/api/user/1
https://example.com/api/user/2
https://example.com/api/user/1?name=John
https://example.com/static/js/main.js
https://example.com/static/css/style.css
https://example.com/images/logo.png
https://example.com/index.php
https://example.com/login.asp?redirect=/dashboard
Command
dedupe -i urls.txt -fe "js,css,png" -qs

https://example.com/api/user/1?name=John
https://example.com/login.asp?redirect=/dashboard

Configuration file (optional)

Place a config.yml next to the binary or in the working directory to set defaults. CLI flags override config values.

Example config.yml:

query_string_only: true
filter_extensions: [js, css, png]
match_extensions: []
lang_country: true

Contribute

Contributions are welcome! Please open an issue or submit a pull request.


πŸ“„ License

This project is licensed under the MIT License. See the LICENSE file for details.


Similar tools


Support

If you find this project useful, please give it a star on GitHub!

Documentation ΒΆ

The Go Gopher

There is no documentation for this package.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL