Elastic Open Crawler · SEO Control Center

Live Operations

Monitor crawler uptime, resources and quick system facts.

Service status

Loading…

Crawler configuration snapshot

Default output —

User agent —

Threads per crawl —

Max depth —

Max pages —

Timeout —

Update defaults from the Run Crawl area. Settings persist across server restarts.

Quick commands

Use SEO Audit Lite user agent for polite scans that look like Chrome.
Keep parallel threads between 2–4 to minimise blocking by target sites.
Export JSONL to re-ingest into Elastic, CSV for quick spreadsheet analysis.

Configure & Launch Crawls

Define connection preferences, crawl defaults and start new discovery jobs in seconds.

Elasticsearch connection (optional)

Loading…

Host URL

Port

Default index

Username

Password

API key

SSL mode

Default output sink

Crawler image

Default user agent

Keep generated config files (useful for debugging)

Clear stored API key Clear stored password

Crawl defaults (anti-blocking)

ברירת המחדל סופר עדינה: 2 תהליכי זחילה, השהייה 1 שניה – אידיאלי למניעת חסימות.

Run a new crawl

Target domain

Job name

User agent

Output sink

Output index (when using Elasticsearch)

Max depth

Threads per crawl

Max unique URLs

Request timeout (seconds)

Max duration (seconds)

Schedule pattern (cron)

Include URLs (optional)

One rule per line. Prefix with contains:, begins:, ends: or regex:. Leave empty to allow all paths.

Exclude URLs (optional)

Rules evaluated after includes. Uses the same prefix syntax.

Custom extraction rules

Capture extra fields from each page. CSS & XPath selectors extract HTML text, regex operates on the URL.

Allow localhost crawling Allow private/lan IPs

הסריקה תופיע בתוצאות תוך שניות ותישמר בבסיס הנתונים המקומי.

Targeted Crawl (URL List)

Fetch a bespoke list of URLs without following additional links. Perfect for QA spot checks.

URL batch

Job name

URLs to crawl

Paste one URL per line. The crawler requests each URL exactly once and does not follow discovered links.

User agent

Output sink

Request timeout (seconds)

Max duration (seconds)

Custom extraction rules

Capture extra fields from each URL. CSS & XPath selectors extract from HTML, regex works on the requested URL.

Usage tips

Perfect for verifying redirects, metadata or status codes on a watch-list.
Supports hundreds of URLs; each domain is handled with a dedicated seed list.
Combine with custom extraction to pull schema, OpenGraph tags or structured snippets.

Results Vault

Browse crawl history, inspect pages, export deliverables or clean up storage.

Job history

Job	Status	Documents	Started	Duration

Documents

Page Size

URL	Title	Status	Words	Links	Headings	Meta

Custom extraction output

URL	Title	Extraction

Document preview

Select a row to preview full JSON output.

Logs

Logs will appear here.

SQL Lab

Run read-only SQL queries against the local crawl archive. Results limited to 500 rows.

Query editor

דוקומנטציה מהירה ל־SQL Lab (לשימוש האנליסט)

מה יש במסד

טבלה מרכזית: documents

עמודות עיקריות: doc_id, job_id, url, title, content_type, status_code, word_count, headings_count, links_count, meta_json, raw_json

שדות התוכן המלאים (טקסט הדף, כותרות, קישורים וכו') נמצאים בתוך raw_json (פורמט JSON). תיאורי מטא לרוב ב־meta_json.

מאפייני המנוע

קריאה בלבד – רק SELECT. אין PRAGMA, ואין יצירה או שינוי של טבלאות.
תצוגת ברירת מחדל רגישה לאותיות. כדי להתעלם מרישיות השתמשו ב־COLLATE NOCASE או LOWER()/UPPER().
תוסף JSON1 זמין ברוב המקרים, כך ש־json_extract() עובד. אם לא, חפשו ישירות במחרוזת ה־JSON.

תבניות שאילתות שכיחות

חיפוש בכותרת בלבד (case-insensitive):

SELECT url, title
FROM documents
WHERE title LIKE '%keyword%' COLLATE NOCASE
LIMIT 500;

חיפוש בתוכן המלא (מתוך JSON) + בכותרת:

SELECT url, title
FROM documents
WHERE title LIKE '%keyword%' COLLATE NOCASE
   OR json_extract(raw_json, '$.body') LIKE '%keyword%'
LIMIT 500;

אם json_extract לא נתמך:

SELECT url, title
FROM documents
WHERE title LIKE '%keyword%' COLLATE NOCASE
   OR raw_json LIKE '%keyword%'
LIMIT 500;

דפים "ארוכים" (לפי ספירת מילים):

SELECT url, title, word_count
FROM documents
WHERE word_count >= 1200
ORDER BY word_count DESC
LIMIT 500;

עמודים ללא Description (מתוך meta_json):

SELECT url, title
FROM documents
WHERE json_extract(meta_json, '$.description') IS NULL
   OR json_extract(meta_json, '$.description') = ''
LIMIT 500;

ללא JSON1:

SELECT url, title
FROM documents
WHERE LOWER(meta_json) NOT LIKE '%"description":"%'
   OR meta_json LIKE '%"description":null%'
LIMIT 500;

איתור מילת מפתח "content" בכותרת בלבד:

SELECT url, title
FROM documents
WHERE title LIKE '%content%' COLLATE NOCASE
LIMIT 500;

ספירת דפים לפי Job:

SELECT job_id, COUNT(*) AS pages
FROM documents
GROUP BY job_id
ORDER BY pages DESC
LIMIT 500;

איתור עודף/חוסר בכותרות וקישורים:

-- יותר מדי קישורים
SELECT url, title, links_count
FROM documents
WHERE links_count > 100
ORDER BY links_count DESC
LIMIT 500;

-- מעט מדי כותרות
SELECT url, title, headings_count
FROM documents
WHERE headings_count <= 1
ORDER BY headings_count ASC
LIMIT 500;

כללי עבודה מומלצים

תמיד הוסיפו LIMIT (המערכת מגבילה ל־500, וזה גם מגן עליכם).
לחיפושים לא רגישים לרישיות: LIKE ... COLLATE NOCASE עדיף על שימוש מיותר ב־LOWER().
כשמחפשים בתוכן, העדיפו json_extract(raw_json, '$.body') אם זמין — מדויק ומהיר יותר מלסרוק את כל מחרוזת ה־JSON.
אם יש כותרות בעברית ובאנגלית, שילוב עם COLLATE NOCASE/UPPER() עובד היטב.

SQL for Noobs

Build smart filters on crawl results without writing a single SQL statement.

Build your query

Choose crawls

Filters

All conditions are combined with AND. Leave empty to skip filters.

Maximum rows

Results

בדיקת USER AGANTS

בדיקת חסימות לפי User-Agent כדי לזהות איזה בוטים מקבלים גישה ואיפה נחסמים.

טופס בדיקה

כתובות לבדיקה

כתובת אחת בכל שורה. מקסימום 10 כתובות בכל הרצה.

דיליי בין בדיקות (ms)

אפשר לשנות לפי הצורך בכל בדיקה.

עקוב אחר הפניות (עד 5)

כאשר יש הפניה מוצגת אזהרה צהובה.

רשימת User Agents

— עריכה בצד

הבדיקה רצה ברצף עם דיליי כדי לא להעמיס.

User Agents נבדקים

Agent	מכשיר	User Agent

השינויים נשמרים בדפדפן ומוחלים על הבדיקה הבאה.

בדיקה מה‑מחשב שלי (Local Runner)

Shell

ב־PowerShell אין צורך בהתקנות נוספות. הפלט נשמר אוטומטית על שולחן העבודה כ־ua-results.txt.

ייבוא פלט ידני

אפשר להדביק פלט JSON ולקבל תצוגה גרפית גם בלי העלאה לשרת.

תוצאות בדיקה

—

חשודים בחסימה / שגיאות

רשימה זו מרכזת חסימות ושגיאות לבדיקה מול מנהל השרת.

איך זה עובד

כל כתובת נבדקת מול כל User-Agent מתוך הרשימה.
הבדיקה מתבצעת מהשרת (IP של ה‑VPS), ולכן יכולה להיות שונה מהמחשב שלך.
המערכת חוסמת בדיקות לכתובות פנימיות/פרטיות (SSRF guard).
הפניות (3xx) מזוהות אך לא מבוצעות בפועל — רק נרשמות.
אדום = חסימה ודאית (401/403/429/451), צהוב = שגיאה/Timeout.
אפשר להעתיק רשימת בעיות ישירות למנהל השרת.

Toolkit & Playbooks

Ideas and workflows to extract maximum SEO value from each crawl.

Site Screener

Use the local archive output to locate orphan pages, broken templates and missing metadata. Filter by word_count, links_count or meta.description inside SQL Lab.

Internal link map

Export CSV and pivot by linksCount to see which pages are authoritative. Combine with headings count to surface thin content.

Competitor intel

ארגן סריקות של מתחרים שונים ושמור אותם כ-Jobs נפרדים. בעזרת SQL אפשר להשוות בין כמות מילים, עומק זחילה ופיזור H1.

Content refresh radar

Schedule monthly crawls (using cron pattern) and compare historic exports. Track when titles or meta descriptions change and notify the team.

Media library audit

Filter documents for content_type LIKE '%image%' or url_path LIKE '%.pdf' to map heavy assets that require CDN or compression.

Automation ideas

Hook the JSONL export into n8n or Zapier to push insights into Slack, Data Studio or Supabase dashboards automatically.