pythongithub数据(代码片段)

author author     2022-12-26     328

关键词:

import pandas as pd
import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq
import requests
import datetime
import os
import gzip
from joblib import Parallel, delayed


def no_unicode(df):
    # can't store python object in parquet files
    types = df.apply(lambda x: pd.api.types.infer_dtype(x.values))
    if len(types) > 0:
        # python 2 check 
        for col in types[types == 'unicode'].index:
            df[col] = df[col].astype(str)
        for col in types[types == 'mixed'].index:
            df[col] = df[col].astype(str)
    return df

def get_hours(last_date):
    """ Returns number of hours (number of files to download)."""
    diff = datetime.datetime.now() - last_date
    days, seconds = diff.days, diff.seconds
    hours = days * 24 + seconds // 3600
    return hours

def get_data(i, last_date=datetime.datetime(2017,1,1, 1)):
    "Update parquet directory with most recent github data."
    date = last_date + datetime.timedelta(hours=i)
    datestring = f'date.year-date.month:02-date.day:02-date.hour'
    url = f'http://data.githubarchive.org/datestring.json.gz'
    r = requests.get(url)
    # write request to disk
    filename = f'datestring.json.gz'
    with open(filename, 'wb') as f:
        f.write(r.content)
    # parse compressed file into json    
    lines = []
    for line in gzip.open(filename, 'rb'):
        lines.append(json.loads(line))
    # store as parquet dataframe
    df = pd.DataFrame(lines)[['id', 'actor', 'created_at', 'repo', 'type']]
    df = no_unicode(df)
    df = df.set_index('id')
    df.to_parquet('parquet/%s.parquet' % filename.split('.json')[0])
    # cleanup
    os.remove(filename)
    
def update_data():
    "Download all the things."
    dates  = [file.split('.')[0] for file in os.listdir('parquet')]
    dates = [datetime.datetime.strptime(date, '%Y-%m-%d-%H') for date in dates]
    last_date = max(dates)
    # waiting sucks, let's try and speed some stuff up
    Parallel(n_jobs=10)(delayed(get_data)(i, last_date) for i in range(get_hours(last_date)))

pythongithub文件查看器/pythonista的一般url下载脚本(代码片段)

查看详情

pythongithub.com/ansible/ansible/contrib/inventory/ec2.py-#15215(代码片段)

查看详情

todo

1.数据结构2.算法3.计算机组成原理、操作系统4.计算机网络、TCP/IP 5.数据库 Linux脚本语言pythongithub  查看详情

实例讲解playwright(代码片段)

...python/docs/introPython部分入口https://github.com/microsoft/playwright-pythongithub,python入口https://github.com/microsoft/playwright-python/releasespython部分的releasenotes本文基于playwright1.32.1发布于2023-3-30转载请注明出处,这是第二篇学习前你得有html、css、... 查看详情

深度学习流行的框架有哪些?分别有什么特点(代码片段)

...绍框架名称:PyTorch主要维护方:Facebook支持的语言:C/C++/PythonGitHub源码地址:[https://github.com/pytorch/pytorch](https://github.com/pytorch/pytorch)2.优点二、Tensorflow⭐⭐⭐⭐⭐1.基本介绍框架名称:Te 查看详情

开源小项目:pyouter0.0.1发布(代码片段)

发布一个业余开源小项目pyouter。pythonGithub仓库:https://github.com/fanfeilong/task_routerpip包安装:https://pypi.org/project/pyouter/0.0.1contributors:@幻灰龙@ccat传统的命令行参数一般会设计-x,--xxx这样的命令解析。一个痛点是命令多... 查看详情

sql数据操作片段(代码片段)

查看详情

caffepython版本fasterr-cnn+zf代码运行

PythonGithub代码地址:https://github.com/rbgirshick/py-faster-rcnn相关参考博客:【1】http://blog.csdn.net/sinat_30071459/article/details/51332084【2】http://www.cnblogs.com/CarryPotMan/p/5390336.html【3】 查看详情

python用于数据探索的python代码片段(例如,在数据科学项目中)(代码片段)

查看详情

pythonflask-数据库片段(代码片段)

查看详情

python数据帧选择数据(代码片段)

查看详情

markdownrcensusapi数据(代码片段)

查看详情

pythondfply数据(代码片段)

查看详情

html数据(代码片段)

查看详情

jsonjson数据(代码片段)

查看详情

sqlmerge数据(代码片段)

查看详情

textscraperwiki数据(代码片段)

查看详情

swift数据(代码片段)

查看详情