爬取Tiktok和Youtube视频

最近,组内要训练类似于SORA的文生视频模型,而要训练模型其数据集肯定是不可或缺的。目前有CebeBV-Text等规模不大的数据集,所以我们打算自己爬取视频去做一个包含文本和视频的人脸数据集。

方案一: 采集地址 --> 第三方转换地址服务 --> 下载视频

方案二:采集地址--> GitHub找到转换地址服务 --> 下载视频

Youtube视频爬取

对单个视频进行爬取

通过对视频播放页的URL进行解析,得到视频所在的URL地址进行下载。

主要采用现在pytube库进行完成这项工作

from pytube import YouTube
def video_downloader(video_url):
    # passing the url to the YouTube object
    my_video = YouTube(video_url)

    # downloading the video in high resolution
    my_video.streams.get_highest_resolution().download()
    # low resolution
    # my_video.streams.first().download() 
    return my_video.title

如果我们只想获取视频的音频

from pytube import YouTube
import os
def video_downloader(video_url):

    my_video = YouTube(video_url)
    my_video.streams.get_audio_only().download()
    my_audio = my_video.streams.get_audio_only().download()
    base, ext = os.path.splitext(my_audio)
    new_file = base + '.mp3'
    os.rename(my_audio, new_file)
 
    return my_video.title

参考

对多个视频进行爬取

我们在上面完成了单个视频的爬取,但是我们要做成一个数据集来说肯定是要爬取海量的视频数据。在理论来说,只爬取Youtube上面的短视频及其描述是最好的,但是受限于技术水平,我们还是爬取长视频并在后期对长视频进行剪辑等操作。

目前采用的方式是,不断刷新带关键词的搜索页,用一个集合去不断获取未被下载的视频。尝试过使用Google的Youtube API 但是存在某些问题不能进行使用。

import requests
import re

from youtuer_dl import video_downloader
keys = ['girl','boy','man','woman','student']
video_ids_all  = set()
sum = 0 
while True:
    for key in keys:
        res = requests.get(f"https://www.youtube.com/results?search_query={key}")
        # print(res.status_code)
        pattern = r'"videoId":\s*"(.+?)"'
        match = re.findall(pattern, res.text)
        video_ids =  list(set(match))
        for video_id  in video_ids:
            if not video_id in video_ids_all:
                video_ids_all.add(video_id)
                video_url = 'https://www.youtube.com/watch?v=' + video_id
                try:
                    print(f'Downloading your Video, please wait.......')
                    video = video_downloader(video_url)
                    print(f'"{video}" downloaded successfully!!')
                    sum += 1
                except:
                    print(f'Failed to download video {video_url}')
    print(f"download videos number is {sum} ")

目前值得优化的地方就是,关键Keys可以由大量与人物相关的内容组成,保证得到视频的多样性。

在进行视频下载时,对于下载失败的视频换一种方式进行下载,即下面提到的TikTok的视频下载方式。

TikTok视频爬取

对单个视频进行爬取

在对TikTok进行下载时,由于在网上找到能进行视频地址的项目都不能正常使用且个人也不是专门做爬虫的水平有限,所以采用第三方网站解析TikTok的信息,从而进行下载。

import requests
import re
import os 

save_dir_path = os.getcwd()+'\\tiltok_videos'
video_url = 'https://www.tiktok.com/@urwomanofgod/video/7313972004518923552'
ID = 1
video_id =  str(ID).zfill(8) 
def video_downloader(video_url, video_id, save_dir_path = os.getcwd()+'\\tiltok_videos'):
    data = {
        'q': video_url,
        'lang': 'zh-cn'
    }
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36'
    }
    res = requests.post('https://tiksave.io/api/ajaxSearch', data=data, headers=headers)
    pattern_title = r'<h3>(.*?)</h3>'
    pattern_video = r'onclick=\\"showAd\(\)\\" href=\\"([^"]+)"'

    HD_link = re.findall(pattern_video, res.text)[2][:-1]
    HD_tile = re.findall(pattern_title, res.text)[0]
    print(HD_tile)

    # 发起GET请求下载视频
    response = requests.get(HD_link, stream=True)
    # 检查请求是否成功
    if response.status_code == 200:
        # 视频保存的路径和文件名
        with open(os.path.join(save_dir_path, video_id + '.mp4'), 'wb') as file:
            for chunk in response.iter_content(chunk_size=1024):
                if chunk:  # 过滤掉保持连接的chunk
                    file.write(chunk)
        print('视频下载成功!')
    else:
        print('视频下载失败,状态码:', response.status_code)

可以进行优化的分别是,保存的视频及其文本信息并不规范 且 最好采用多个第三方解析源进行解析。

对多个视频进行爬取

由于TikTok的视频是慢加载,所以上面直接获取源码进行提取URL的方式就行不通了,于是采用selenium自动化的方式进行获取下载的链接。

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
import time 
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
import re
service = Service(executable_path='./chromedriver.exe')

cookie_data = 'tt_csrf_token=YxzTUcrt-fkoGIPnPyCJeTIv6nf0wlJ95Aco; tiktok_webapp_theme=light; passport_csrf_token=72efd53bfbf54f699431cd01f62659de; passport_csrf_token_default=72efd53bfbf54f699431cd01f62659de; s_v_web_id=verify_lti2x8hl_zxICnl1I_yBPa_4vK2_Ab6X_FUwIug8Ope1k; store-country-code-src=uid; last_login_method=handle; csrfToken=pEdHeoHu-lvHlQ_1iJCXUYce1XnBcukUuung; csrf_session_id=68d76ff723efaf7ba0c6c518684a44b9; passport_auth_status=a3096af65e8f030518f134f88e0d75b6%2C0a7c75e3e51e66f57fee4417f3f5d226; passport_auth_status_ss=a3096af65e8f030518f134f88e0d75b6%2C0a7c75e3e51e66f57fee4417f3f5d226; multi_sids=; odin_tt=21d2894c1c62e3935fb59f9110a9163747d842d1414b55b628bc4f4dee3c5f651a86fcbeaa108ba27ea1b7761310f9535de90d3ccfcfeb4d92a98a83a1af79ea08d97c894d189f1776c99e465d3dcc94; tt_chain_token=vJ+yHgdL/4zBVN7kLZcmFg==; ak_bmsc=F1D9F4D5C0BB2EC66EF982452CD085B2~000000000000000000000000000000~YAAQdFA7F/u/uCyOAQAAfQVPLRciZ5rXsHszU9tYMwO4ERwMjROkUmt3t5U36NaauibDq3zWYxk7HnZA3v4IuS1JHU13G4l8sy4XQpQTweHeDD/TFPQII/kIisFvrMJlO6zlLXRUdWMmt/2eV08uuWDU3KsFvtE6OPAZc/Od8ap4dMu0MHdPzSIRsvKlG0wxmplOmM4Z1TWBEPGEp3xEhc8oN19jbHm06MXuMw9dQnGDqVWiVuJNBql50oPSPHk7hOWhV2tlX/q/ACWAORuJHljWL+0FcBS8yn/cbYEfO1L9o9U1ePhkAIwTZz2OpN4jnZFeJzhfGeEEVJvx+sfkMZhn1c2uIjUgiotEwNhSAS7/RA8Z0lG66dTpy4Uw0YawKRFzKFEYDvf9D68=; ttwid=1%7CJNhvSzImQhCSwauiGQajmaN9mlIIl9K-G6ZSWKQZoTE%7C1710161056%7Ce1d2de87a8b6e87fb2e2d7d0b8fbd8bb68b4de223da3ca90917bad7304bcb933; perf_feed_cache={%22expireTimestamp%22:1710331200000%2C%22itemIds%22:[%227344764894887038226%22%2C%227343235612687977733%22%2C%227340208628512738566%22]}; msToken=TwnLMYVV1heZ5pWqA8hExGgXyP_I4ogLMILJENTX1clBhPQJJtbflmqWbiiEQlC82v_xJkVGm2X88RckZcBs0_rRxVPAqDCxPXTxnCAZp2-fBH0UPnZUZUPtPakT0I3huDQHDMQHRDN_7Qh2VQwO; msToken=TwnLMYVV1heZ5pWqA8hExGgXyP_I4ogLMILJENTX1clBhPQJJtbflmqWbiiEQlC82v_xJkVGm2X88RckZcBs0_rRxVPAqDCxPXTxnCAZp2-fBH0UPnZUZUPtPakT0I3huDQHDMQHRDN_7Qh2VQwO; bm_sv=4A4E4272230D20D186D0FA9484BA7631~YAAQxVA7Fzj7LieOAQAAlQeLLRcHeJqiuT5QitIKHnujFgBhHM7xCBlWIpsSnEfiyK4GcHCE4EneLb5/SgkJEm/fkGHNRGEm108BLMJm7yffGBPXwdmlPwugL3GRZWS4oe0XOVrIyjYbal4e5oICVLfAxIJSDgDKONG/JRTXnEvCz9OeKxFGtAxEruS3ydy1FFa/a2lnw4RTEb4up+ANGRyN7rTL9R3Yf7E54hedE8kPmTikMpIc0atP5iIo3f9zag==~1'
pattern = r'(?P<key>[^=;,]+)=(?P<value>[^;,]+)'
matches = re.finditer(pattern, cookie_data)
cookies = {}
for match in matches:
    key = match.group('key').strip()
    value = match.group('value').strip()
    cookies[key] = value
cookies    

# ChromeDriver的路径
bro  = webdriver.Chrome(service=service)
bro.get('https://www.tiktok.com')
# 现在你可以使用driver来打开网页、与网页交互等
for name, value in cookies.items():
    cookie_dict = {
        'name': name,
        'value': value,
        # 根据需要添加其他cookie字段,如'path', 'domain', 'secure'等
    }
    bro.add_cookie(cookie_dict)
bro.get('https://www.tiktok.com/search?q=girl')
# bro.get('https://www.tiktok.com')



time.sleep(10)

scrolls = 10
for _ in range(scrolls):
    WebDriverWait(bro, 10).until(
        EC.presence_of_all_elements_located((By.CLASS_NAME, "css-1soki6-DivItemContainerForSearch"))
    )
    elements = bro.find_elements(By.CLASS_NAME, "css-1soki6-DivItemContainerForSearch")
    for element in elements:
        sub_element = element.find_elements(By.CLASS_NAME, "css-1as5cen-DivWrapper")
        for child in sub_element:
            link = child.find_element(By.TAG_NAME, "a")  # 在子标签内部找到<a>标签
            href = link.get_attribute("href")  # 获取<a>标签的href属性
            print(href)
    # 模拟按下页面的"End"键,滚动到页面底部
    bro.find_element(By.TAG_NAME, 'body').send_keys(Keys.END)
    
    # 等待内容加载
    time.sleep(10)  # 根据网页的加载速度调整等待时间
  
# 关闭浏览器
bro.quit()

参考

值得优化的地方还有很多,有时间再慢慢修改。

总结

爬虫是我大一就接触的技术,目前确还是停留在使用的阶段,惭愧。有空一定学会视频解析。