Skip to content

如何使用 Python 和 Selenium 抓取 Twitter 数据:完整教程

Published: at 12:00 PMSuggest Changes

引言

Twitter 作为全球最大的社交媒体平台之一,包含着海量的实时信息和数据。本文将详细介绍如何使用 Python 和 Selenium 来自动化抓取 Twitter 数据,帮助您进行研究、分析或监控工作。

Table of contents

Open Table of contents

为什么要抓取 Twitter 数据?

通过抓取 Twitter 数据,我们可以:

抓取数据示例

以下是一个典型的推文数据结构示例:

{
  "type": "tweet",
  "id": 1843447413824209160,
  "viewCount": "51275823",
  "url": "https://x.com/elonmusk/status/1843447413824209160",
  "twitterUrl": "https://twitter.com/elonmusk/status/1843447413824209160",
  "text": "It is a surefire way for the Dems to turn America in a one-party state, just like California",
  "isQuote": true,
  "retweetCount": 59493,
  "replyCount": 11090,
  "likeCount": 250068,
  "quoteCount": 1661,
  "createdAt": "Tue Oct 08 00:24:47 +0000 2024",
  "lang": "en",
  "quoteId": "1843379457605939258",
  "bookmarkCount": 11177,
  "isReply": false,
  "source": "Twitter for iPhone",
  "author": {
    "type": "user",
    "username": "elonmusk",
    "url": "https://x.com/elonmusk",
    "twitterUrl": "https://x.com/elonmusk",
    "id": "44196397",
    "name": "Elon Musk",
    "isVerified": false,
    "isBlueVerified": true,
    "verifiedType": "",
    "profilePicture": "https://pbs.twimg.com/profile_images/1849727333617573888/HBgPUrjG_normal.jpg",
    "coverPicture": "https://pbs.twimg.com/profile_banners/44196397/1726163678",
    "description": "Read @America to understand why I’m supporting Trump for President",
    "location": "",
    "followers": 202400789,
    "following": 794,
    "protected": false,
    "status": "",
    "canDm": false,
    "canMediaTag": false,
    "createdAt": "Tue Jun 02 20:12:29 +0000 2009",
    "advertiserAccountType": "",
    "analyticsType": "",
    "entities": {
      "description": {
        "urls": [
          
        ]
      },
      "url": {
        "urls": [
          {
            "display_url": "TheAmericaPAC.org",
            "expanded_url": "http://TheAmericaPAC.org",
            "url": "https://t.co/DjyKIO6ePx",
            "indices": [
              0,
              23
            ]
          }
        ]
      }
    },
    "fastFollowersCount": 0,
    "favouritesCount": 83676,
    "geoEnabled": false,
    "hasCustomTimelines": true,
    "hasExtendedProfile": false,
    "isTranslator": false,
    "mediaCount": 2637,
    "profileBackgroundColor": "",
    "statusesCount": 55447,
    "translatorTypeEnum": "",
    "withheldInCountries": [
      
    ],
    "affiliatesHighlightedLabel": {
      "label": {
        "url": {
          "url": "https://twitter.com/X",
          "urlType": "DeepLink"
        },
        "badge": {
          "url": "https://pbs.twimg.com/profile_images/1683899100922511378/5lY42eHs_bigger.jpg"
        },
        "description": "X",
        "userLabelType": "BusinessLabel",
        "userLabelDisplayType": "Badge"
      }
    }
  },
  "quote": {
    "type": "tweet",
    "id": "1843379457605939258",
    "text": "Elon Musk explains how this will be our last real election if Kamala Harris wins.\n\nEveryone must watch this. https://t.co/DoBh9qM7K7",
    "retweetCount": 10725,
    "replyCount": 1848,
    "likeCount": 38268,
    "quoteCount": 790,
    "createdAt": "Mon Oct 07 19:54:45 +0000 2024",
    "lang": "en",
    "bookmarkCount": 5143,
    "author": {
      "type": "user",
      "username": "EndWokeness",
      "url": "https://x.com/EndWokeness",
      "twitterUrl": "https://x.com/EndWokeness",
      "id": "1552795969959636992",
      "name": "End Wokeness",
      "isVerified": false,
      "isBlueVerified": true,
      "verifiedType": "",
      "profilePicture": "https://pbs.twimg.com/profile_images/1563691268793946117/OedvhFeS_normal.jpg",
      "coverPicture": "https://pbs.twimg.com/profile_banners/1552795969959636992/1720913469",
      "description": "Fighting, exposing, and mocking wokeness. DM for submissions",
      "location": "",
      "followers": 3107102,
      "following": 1177,
      "protected": false,
      "status": "",
      "canDm": true,
      "canMediaTag": true,
      "createdAt": "Thu Jul 28 23:20:28 +0000 2022",
      "advertiserAccountType": "",
      "analyticsType": "",
      "entities": {
        "description": {
          "urls": [
            
          ]
        }
      },
      "fastFollowersCount": 0,
      "favouritesCount": 13138,
      "geoEnabled": false,
      "hasCustomTimelines": true,
      "hasExtendedProfile": false,
      "isTranslator": false,
      "mediaCount": 7219,
      "profileBackgroundColor": "",
      "statusesCount": 15502,
      "translatorTypeEnum": "",
      "withheldInCountries": [
        
      ],
      "affiliatesHighlightedLabel": {
        
      }
    }
  }
}

完整实现步骤

1. 环境配置

首先需要安装必要的依赖包:

pip install selenium
pip install webdriver_manager
pip install pandas

2. 下载 ChromeDriver

访问 Chrome for Testing 下载对应版本的 ChromeDriver。确保 ChromeDriver 版本与您的 Chrome 浏览器版本匹配。

3. 启动调试模式的 Chrome

创建一个批处理文件来启动调试模式的 Chrome:

@echo off
start C:\software\chrome-win64\chrome.exe --remote-debugging-port=9223

4. 配置 Chrome 选项

def setup_chrome_options(self):
    options = webdriver.ChromeOptions()
    user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
    options.add_argument(f'user-agent={user_agent}')
    options.add_argument('--disable-gpu')
    options.add_argument('--no-sandbox')
    options.add_argument('--disable-dev-shm-usage')
    options.add_experimental_option("debuggerAddress", "localhost:9223")
    return options

5. 实现搜索功能

def search_tweets(self, search_query):
    """
    搜索并获取推文数据
    """
    self.browser.switch_to.new_window('tab')
    url = "https://x.com/explore"
    self.browser.get(url=url)
    
    # 定位搜索框
    search_box = self.browser.find_element(
        By.CSS_SELECTOR, 
        '[data-testid="SearchBox_Search_Input"]'
    )
    
    # 清空搜索框并输入搜索词
    search_box.send_keys(Keys.CONTROL + "a")
    search_box.send_keys(Keys.DELETE)
    self.browser.implicitly_wait(20)
    search_box.send_keys(search_query)
    search_box.send_keys(Keys.RETURN)

6. 监控网络响应

def monitor_network(self):
    """
    监控浏览器网络响应
    """
    performance_log = self.browser.get_log("performance")
    for packet in performance_log:
        message = json.loads(packet.get("message")).get("message")
        
        if "Network" in message.get("method") and 'SearchTimeline' in msg:
            document_url = message['params'].get('documentURL')
            if document_url and '&f=live' in document_url:
                request_id = message.get("params").get("requestId")
                # 处理请求数据

7. 数据提取与处理

def extract_tweet_data(self, entries):
    """
    从响应中提取推文数据
    """
    tweets = []
    for entry in entries:
        item_content = entry['content'].get('itemContent', '')
        if not item_content:
            continue
            
        tweet_result = entry['content']['itemContent']['tweet_results']['result']
        tweets.append({
            'id': tweet_result.get('id_str'),
            'text': tweet_result.get('text'),
            'created_at': tweet_result.get('created_at'),
            'author': tweet_result.get('user'),
            'metrics': tweet_result.get('public_metrics')
        })
    return tweets

重要注意事项

  1. Twitter Cookie 获取

  2. 替代方案

  3. 使用限制

    • 遵守 Twitter 的使用条款
    • 注意请求频率限制
    • 合理使用代理服务器

常见问题解答

Q: 为什么选择 Python 和 Selenium?

A: Python 提供了丰富的库支持,而 Selenium 能够模拟真实的浏览器行为,可以处理动态加载的内容。

Q: 需要什么编程基础?

A: 基本的 Python 编程知识,了解 HTML 和 CSS 选择器会有帮助。

Q: 数据可以用来做什么?

A: 可以用于:

Q: 如何处理大量数据?

A: 建议:

总结

使用 Python 和 Selenium 抓取 Twitter 数据是一个强大的工具,可以帮助我们自动化收集和分析社交媒体数据。通过本文介绍的方法,您可以构建自己的数据采集系统,实现特定的研究或监控需求。

记住要遵守平台的使用条款,合理使用这些工具和数据。如果您需要更多支持,可以加入我们的 讨论组

扩展资源


Next Post
How to configure AstroPaper theme