探索使用数据抓取实现求职申请自动化

社区文章发布于2024年5月10日

前言

这将是一系列关于我作为项目所做或正在尝试学习的内容的帖子/文章。

引言

在当今竞争激烈的就业市场中，效率是关键。自动化重复性任务是提高效率的一种方法。虽然由于不同的申请要求和人际互动，完全自动化可能无法实现，但部分自动化可以帮助减轻负担。本文探讨了自动化求职申请的技术方面，重点关注理论概念而非实际应用，以确保我们遵守网站的服务条款。

什么是网络爬取？

网络爬取是一种从网站提取信息的技术。它利用各种编程语言和工具，自动化从网络收集数据的过程。

任何网络爬取项目的第一步都是导入必要的 Python 包并设置一个网络驱动器。网络驱动器允许我们以编程方式控制网络浏览器，从而能够像人类用户一样与网页进行交互。

例如，我们将使用 Selenium，这是一个通过程序控制网络浏览器和自动化浏览器自动化的强大工具。我们还将使用 Pandas 进行数据操作，并使用 time 来暂停延迟。

#Import Packages
from selenium import webdriver # Web scraping library to control browser interaction
import pandas as pd 
from time import sleep
import requests
from lxml import html
import traceback

# The following imports are included but not used in this code snippet. 
# They can be useful for advanced functionalities like waiting for certain conditions.
# from selenium.webdriver.support.select import Select 
# from selenium.webdriver.support.ui import WebDriverWait 
# from selenium.webdriver.common.by import By 
# from selenium.webdriver.support import expected_conditions as EC 

# The import below is not used but can be important if you want to add Chrome options.
# from selenium.webdriver.chrome.options import Options 

# Path to webdriver, can be chrome or anything else you have
# Note: Replace the empty string with the path to your webdriver executable.
driver = webdriver.Chrome(executable_path="")

driver = webdriver.Chrome(executable_path=""): 初始化一个由 webdriver 控制的新的 Chrome 浏览器窗口。将空字符串替换为你的 ChromeDriver 可执行文件的路径。请确保你已下载与你的 Chrome 浏览器兼容的相应 ChromeDriver 版本。

导航至领英并登录

导入必要的包并设置好网络驱动器后，下一步是导航到领英的登录页面。我们将以编程方式输入用户名和密码，然后点击“登录”按钮。

# Define the URL for LinkedIn login
url1='https://www.linkedin.com/login'

# Implicitly wait for 1 second; gives the browser time to load scripts
driver.implicitly_wait(1)

# Navigate to the LinkedIn login page
driver.get(url1)

# Locate email field by its id and input email
email_field = driver.find_element_by_id('username')
email_field.send_keys('') # Replace empty quotes with your email
print('- Finish keying in email')
sleep(1) # Pause for 1 second

# Locate password field by its name attribute and input password
password_field = driver.find_element_by_name('session_password')
password_field.send_keys('') # Replace empty quotes with your password
print('- Finish keying in password')
sleep(1) # Pause for 1 second

# Locate the sign-in button by its XPath and click it
signin_field = driver.find_element_by_xpath('//*[@id="organic-div"]/form/div[3]/button')
signin_field.click()
sleep(1) # Pause for 1 second

获取你想申请的公司名称

成功登录后，你可能想获取一个你想申请的公司列表。对于这个例子，我们将从 SWFI Institute 网站抓取对冲基金经理的列表，并将名称保存到 CSV 文件中。

# URL of the page with a list of companies you're interested in
url = "https://www.swfinstitute.org/fund-manager-rankings/hedge-fund-manager"

# Fetch the webpage
response = requests.get(url)

# Parse the HTML content
tree = html.fromstring(response.content)

# Extract company names using XPath
companies = tree.xpath('//*[contains(concat( " ", @class, " " ), concat( " ", "table-striped", " " ))]//a/text()')

# Save the list of companies to a CSV file
pd.DataFrame(companies, columns=["Company"]).to_csv("companies.csv", index=False)

# Read the list back into a DataFrame to confirm
df = pd.read_csv("companies.csv")
df.head()

清理公司名称并查找领英页面

一旦我们从 CSV 文件中获得了公司列表，我们的下一个目标是在领英上搜索这些公司。然而，公司名称通常可能包含“LLC”、“Inc.”等名称，这可能会干扰我们的搜索。因此，第一步是清理这些名称。

之后，我们以编程方式访问领英的每个清理后的公司名称的公司搜索页面，抓取该公司的领英 URL，并存储这些 URL 以供以后使用。

def clean_company_name(company_name):
    # List of extra stuff we want to remove from the company name
    extra_stuff = ['LLC', 'LP', 'Inc.', 'Co.', 'Corp.', 'Ltd.', 'LLP', 'PLC', 'AG', 'AB', 'BV', 'GmbH']
    
    # Remove extra stuff from the name
    cleaned_name = ' '.join(word for word in company_name.split() if word not in extra_stuff)
    
    # Remove any commas
    cleaned_name = cleaned_name.replace(',', '')
    
    # Replace spaces with URL-encoded spaces (%20) for the URL
    cleaned_name = cleaned_name.replace(' ', '%20')
    
    return cleaned_name

# List to hold company LinkedIn URLs
company_links = []

# Iterate through DataFrame rows
for i, row in df.iterrows():
    # Clean the company name
    clean_name = clean_company_name(row['Company'])
    
    # Generate LinkedIn search URL for the cleaned company name
    company_search_page = f"https://www.linkedin.com/search/results/companies/?keywords={clean_name}"
    print(company_search_page)
    
    # Navigate to LinkedIn search page
    driver.get(company_search_page)
    sleep(2) # Wait 2 seconds for the page to load
    
    # Try to find the LinkedIn URL of the company
    try:
        # XPath for most common scenario
        element = driver.find_element_by_xpath('/html/body/div[5]/div[3]/div[2]/div/div[1]/main/div/div/div[2]/div/ul/li[1]/div/div/div[2]/div[1]/div[1]/div/span/span/a')
    except:
        try:
            # Backup XPath for less common scenario
            element = driver.find_element_by_xpath('/html/body/div[5]/div[3]/div[2]/div/div[1]/main/div/div/div[3]/div/ul/li/div/div/div[2]/div[1]/div[1]/div/span/span/a')
        except:
            # If both fail, skip to next iteration
            continue
            
    # Extract the LinkedIn URL
    link = element.get_attribute("href")
    
    # Append the URL to our list
    company_links.append(link)

收集到你感兴趣公司的领英 URL 后，最好将这些数据本地保存。这样，你可以重复使用该列表，而无需再次抓取领英，从而节省时间和计算资源。

在 Python 中，保存和加载对象的一种方法是使用 pickle 模块，它允许你将 Python 对象序列化（保存）和反序列化（加载）到文件中。

import pickle

# Saving the list to a file
filename = "company_list.pkl"
# Uncomment the following lines to actually save the list
# with open(filename, "wb") as file:
#     pickle.dump(company_links, file)

# Loading the list from the file
with open(filename, "rb") as file:
    loaded_list = pickle.load(file)

# Printing the loaded list to confirm
print(loaded_list)

从公司页面收集职位发布 URL

收集到感兴趣公司的领英 URL 后，下一步是浏览它们的招聘信息。我们将导航到每个公司的招聘页面，抓取单个招聘信息的 URL，并存储它们以供以后使用。

以下是如何从领英公司页面收集职位发布 URL

# List to hold job LinkedIn URLs
job_links = []

# Iterate through each company's LinkedIn URL in the loaded list
for company_url in loaded_list:
    # Navigate to the 'jobs' section of the company's LinkedIn page
    driver.get(company_url + "jobs")
    
    try:
        # Try to find the link to all job listings for the company
        element = driver.find_element_by_xpath('/html/body/div[5]/div[3]/div/div[2]/div/div[2]/main/div[2]/div/section/div/div/a')
    except:
        try:
            # If the above fails, try another XPath pattern
            element = driver.find_element_by_xpath('/html/body/div[4]/div[3]/div/div[2]/div/div[2]/main/div[2]/div/section/div/div/a')
        except:
            # If both XPaths fail, skip to the next company
            continue

    # Extract the URL of the page containing all job listings
    link = element.get_attribute("href")

    # Navigate to the extracted link
    driver.get(link)
    sleep(1)  # Wait for a second to ensure the page loads
    
    # Find the ul element containing the job listings
    ul_element = driver.find_element_by_xpath("/html/body/div[5]/div[3]/div[4]/div/div/main/div/div[1]/div/ul")
    
    # Extract all the li elements (each represents a job listing) within the ul
    li_elements = ul_element.find_elements_by_tag_name("li")

    # Iterate over each li element to extract job links
    for element in li_elements:
        link_element = element.find_elements_by_class_name("job-card-list__title")
        if link_element:
            job_links.append(link_element[0].get_attribute("href"))

# Filename to store the list of job URLs
filename = "job_links.pkl"

# Saving the list to a file (uncomment when ready to save)
# with open(filename, "wb") as file:
#     pickle.dump(job_links, file)

# Loading the list back from the saved file
with open(filename, "rb") as file:
    loaded_job_links = pickle.load(file)

# Printing the loaded list to ensure it loaded correctly
print(loaded_job_links)

提取职位详情

我们之前已经成功编译了公司和职位 URL 列表。下一个关键步骤是从职位发布本身提取相关详细信息。这将包括职位名称、所需资格以及招聘经理姓名等信息。


# Initialize an empty list to hold all the scraped job data
job_data = []

# Iterate over each job link that we've previously loaded
for job in loaded_list:
    try:
        # Navigate to the job URL
        driver.get(job)

        job_details = {}  # Initialize an empty dictionary to hold details for this job

        # Scraping different fields for job details
        try:
            job_details['job_title'] = driver.find_element_by_class_name("jobs-unified-top-card__job-title").text
        except Exception:
            print("Exception when fetching job title")
            print(traceback.format_exc())

        try:
            job_details['job_metadata'] = driver.find_element_by_class_name("jobs-unified-top-card__primary-description").text
        except Exception:
            print("Exception when fetching job metadata")
            print(traceback.format_exc())

        try:
            job_details['hiring_team_name'] = driver.find_element_by_class_name("jobs-poster__name").text
        except Exception:
            print("Exception when fetching hiring team name")
            print(traceback.format_exc())

        try:
            test = driver.find_element_by_class_name("hirer-card__hirer-information")
            a_element = test.find_element_by_xpath(".//a")
            job_details['hiring_team_link'] = a_element.get_attribute("href")
        except Exception:
            print("Exception when fetching hiring team link")
            print(traceback.format_exc())

        company_metadata = []
        try:
            for i in driver.find_elements_by_class_name("jobs-unified-top-card__job-insight"):
                company_metadata.append(i.text)
            job_details['company_metadata'] = company_metadata
        except Exception:
            print("Exception when fetching company metadata")
            print(traceback.format_exc())

        try:
            job_details['description'] = driver.find_element_by_class_name("jobs-description-content__text").text
        except Exception:
            print("Exception when fetching job description")
            print(traceback.format_exc())

        # Append the collected job details to the master list
        job_data.append(job_details)

    except Exception:
        print("Exception when processing job: ", job)
        print(traceback.format_exc())

# Convert the list of job details dictionaries into a DataFrame
df = pd.DataFrame(job_data)

# Add a column for the LinkedIn job links
df['link'] = loaded_list  # We already loaded these job links in a previous step

# Preview the DataFrame to check the data
df.head()

# OPTIONAL: If you've saved the data to a CSV and you're reading it back in
# df = pd.read_csv("final_output.csv", index_col=0)

# Clean up the 'description' column to replace newline characters with a space
df['description'] = df['description'].str.replace(r'\r?\n', ' ', regex=True)

利用 Hugging Face 的 Transformers 进行文本分析

近年来，自然语言处理 (NLP) 取得了显著进展，其中大部分归功于 Transformer 的开创性工作。Hugging Face，一家人工智能研究机构，通过其 Transformers 库，使开发者更容易访问这些用于各种 NLP 任务的高端机器学习模型。这个开源库提供了预训练模型和管道集合，可以执行文本摘要、文本生成、翻译等任务。

Hugging Face 的 Transformers 在我们求职申请自动化的背景下尤其有用，因为它使我们能够更好地理解职位描述和要求并与之交互。借助预训练模型，我们可以自动总结冗长的职位描述以快速掌握基本细节，甚至根据职位发布生成个性化求职信或申请片段。

使用 Hugging Face 进行文本摘要

当您处理需要花费大量时间阅读的大量职位描述时，摘要任务是无价的。只需几行代码，您就可以总结这些描述，捕获最重要的内容。我们将使用“sshleifer/distilbart-cnn-12-6”模型，该模型经过微调以执行摘要任务。

使用 Hugging Face 进行文本生成

除了摘要，文本生成还可以提供更多与职位数据交互的创新方式。例如，根据职位描述中所需的技能，您可以生成一段突出您匹配技能和经验的段落。为此，我们将使用“meta-llama/Llama-2-7b-chat-hf”模型，该模型专门用于生成类人对话文本。

# Import the pipeline from the transformers library
from transformers import pipeline

# Initialize the summarization pipeline
get_completion = pipeline("summarization", model="sshleifer/distilbart-cnn-12-6")

# Initialize the text-generation pipeline
pipe = pipeline("text-generation", model="tiiuae/falcon-7b-instruct")

一旦我们设置好摘要管道，下一步自然是将其能力应用于我们现有的职位描述。总结每个职位描述可以更容易地快速理解每个职位发布的要点。

摘要函数

现在我们来定义一个名为 summarize 的函数，它将接收输入文本并返回其摘要版本。我们还将包括异常处理，以确保函数在遇到任何问题时能够优雅地失败。

def summarize(input):
    try:
        # Using the get_completion pipeline to summarize the input
        output = get_completion(input)
        return output[0]['summary_text']
    except ValueError:
        # Handle the ValueError here if it occurs
        pass

# Applying the summarize function to the 'description' column in our DataFrame
df['description_summary'] = df['description'].apply(summarize)

# Saving the DataFrame with the summarized descriptions
df.to_csv("summarized_description.csv")

使用文本生成自动生成求职信

在总结了职位描述之后，剩下的一个关键要素是为每个职位创建一份定制的求职信。为每份求职申请编写一份独特且引人注目的求职信是一个耗时的过程。如果能将此过程也自动化，那岂不是太棒了？

在本节中，我们将定义一个函数，该函数将利用 Hugging Face 的文本生成能力为每个职位发布生成个性化的求职信。

生成求职信函数

我们将定义一个名为 `generate_cover_letter` 的函数，它将接收 DataFrame 的行，其中包含摘要的职位描述、职位名称和其他元数据。此函数将生成针对每个职位描述的求职信。

以下是函数代码片段

def generate_cover_letter(row):
    # Create a prompt incorporating company name, job title, and description summary
    prompt = f"Dear Hiring Team at {row['Company']},\n\nI am excited to apply for the {row['job_title']} position. I was particularly drawn to this role because {row['description_summary']}. Can you please generate a cover letter for me?"

    try:
        # Using the 'pipe' pipeline for text generation based on the prompt
        generated_text = pipe(prompt, max_length=500, do_sample=True, top_k=50)[0]['generated_text']
        
        # Extracting the generated cover letter from the generated_text
        start = generated_text.find("Dear Hiring Team")
        if start != -1:
            generated_cover_letter = generated_text[start:]
            return generated_cover_letter

    except Exception as e:
        print(f"An error occurred: {e}")
        return None

# Applying the generate_cover_letter function to each row in the DataFrame
df['generated_cover_letter'] = df.apply(generate_cover_letter, axis=1)

# Saving the DataFrame with generated cover letters
df.to_csv("with_generated_cover_letters.csv")

改用 Ollama 更新生成部分

自从我写这篇文章以来，ollama 问世了，说实话，它可能是运行 LLM 模型的最佳工具，因为它支持直接从 Python 或终端调用函数。

让我们从重新阅读上次的 CSV 文件开始

import ollama 
import pandas as pd

df = pd.read_csv("summarized_description.csv", index_col=0)


def compose_email(hiring_team_name, job_title, job_metadata, hiring_team_link, company_metadata,description, link, description_summary):
    """
    Generates a personalized email for a job application using provided details.

    Parameters:
    - hiring_team_name: The name of the person in the hiring team.
    - job_title: The title of the job.
    - job_metadata: Additional details about the job.
    - hiring_team_link: A link to learn more about the hiring team or the person.
    - company_metadata: Information about the company.
    - link: A link to the job posting.
    - description_summary: A summary of the job description, emphasizing key responsibilities and qualifications.

    Returns:
    A string containing the personalized email.
    """
    email_body = f"""
Dear {hiring_team_name},

I am reaching out to express my genuine interest in the {job_title} position as advertised on {link}. With a solid background in {job_metadata}, I am enthusiastic about the opportunity to contribute to {company_metadata}, a company I greatly admire for its [insert reason based on company_metadata].

The role's focus on {description_summary} aligns perfectly with my professional skills and personal interests. I am confident that my experience in [mention relevant experience] makes me a strong candidate for this position. I am particularly excited about the chance to [mention a specific project or responsibility mentioned in the job description].

Here’s why I believe I am a good fit for the role:
- [Briefly highlight key qualifications and experiences relevant to the job description_summary]

I am very much looking forward to the possibility of discussing this exciting opportunity with you further. Please let me know if you need any more information or if there's a convenient time for us to discuss how I can contribute to your team.

Thank you for considering my application.

Best regards,
[Your Name]
"""

    return email_body.strip()

# Bring back first row for testing
row = df.iloc[0]
row

job_title                                            Investment Engineer
job_metadata           Bridgewater Associates · Westport, CT (On-site...
hiring_team_name                                         Jason Koulouras
hiring_team_link              https://www.linkedin.com/in/jasonkoulouras
description            About the job About Bridgewater  Bridgewater A...
company_metadata                                                     NaN
link                   https://www.linkedin.com/jobs/view/3659222868/...
description_summary     Investment Engineer's mission is to understan...
Name: 0, dtype: object

message = compose_email(**row)
message

model_input = {
    'role': 'system',  # or 'user', depending on the context
    'content': message
}

response = ollama.chat(model='llama2', messages=[model_input])
print(response['message']['content'])

## Create a class or a function  to loop over the dataframe and send emails to the hiring team if you would like to

Dear Jason Koulouras,

I am writing to express my strong interest in the Investment Engineer position at Bridgewater Associates. I was drawn to this opportunity due to the company's reputation as a leader in the financial industry and its commitment to innovation and excellence. As an experienced investment professional with a solid background in [relevant field], I am confident that my skills and experience align perfectly with the requirements of this role.

From my research on Bridgewater Associates, I understand that the Investment Engineer's primary responsibility is to design and build the algorithms used to generate the company's views on markets and economies, as well as translate those views into portfolios and trade them. As someone who shares the company's passion for understanding how these systems work and using that knowledge to inform investment decisions, I am excited about the opportunity to contribute to this mission.

In my current role at [current company], I have gained valuable experience in [relevant skills or experiences]. These skills, combined with my ability to work well in a team and communicate complex ideas effectively, make me a strong candidate for this position. I am particularly enthusiastic about the opportunity to work on [specific project or responsibility mentioned in the job description] and contribute my expertise to help drive Bridgewater's success.

Thank you for considering my application. I would be thrilled to discuss how I can contribute to your team and help drive the company's continued growth and innovation. Please let me know if there is any additional information you need or if you would like to schedule a time to speak further.

Sincerely,
[Your Name]

注意：

这并非完美，您不应自动执行此应用程序部分。此项目的全部目的是自动化寻找理想职位的数据收集部分；申请应始终保留您的个人风格，因为这将是唯一能在数千份简历中脱颖而出的因素。

结论

您仍然可以利用 LLM 部分做更多事情，例如筛选掉您不想要的职位，或者参与创建更专业的电子邮件。天空（或者您的处理能力）是您的极限。

求职申请自动化，主要通过数据抓取和机器学习，可以显著简化求职者和招聘人员的招聘流程。本教程演示了如何自动从领英收集职位发布，然后使用机器学习模型（特别是来自 Hugging Face 的 Transformers 库）生成个性化的求职信。

我们已经详细介绍了如何使用 Selenium 设置网络爬虫，获取相关的职位详细信息，以及利用 Hugging Face 的 NLP 能力动态生成内容。

但是，我想指出，申请人应审查和微调生成的求职信，以确保它们符合个人的独特技能、经验和抱负。自动化工具可以加快流程，但不能完全取代人工。

最后，请确保您遵守所使用的任何第三方服务设定的条款和条件或准则。本教程纯粹是教育性的，不应被视为鼓励违反此类条款。

祝您求职顺利，愿您的自动化努力助您找到梦想中的工作！

我们的文章到此结束。希望本教程对您有所帮助。感谢您的阅读！

社区

通过拖放到文本输入框、粘贴或点击此处上传图片、音频和视频。

点击或粘贴此处以上传图片

· 注册或登录评论